Data Visualization

Data Visualization

In this lesson, you’re expected to:
– understand the importance of presenting data in an effective manner
– learn about different types of visualization methods

Why is Data Visualization important in Business Statistics?

Data visualization is very important in business statistics mainly because of two reasons:

1) Exploratory Analysis

Once we have collected the data for our study, the first step is to explore the data.

Graphs help us understand our data – find possible relationships between variables, find outliers, learn how the data is distributed etc.

An early visualization of the data will help us to gain intuition about it and formulate better hypotheses. Visually exploring our data is a must, before we can start performing more advanced statistics or modelling.

2) Communicating Results

A fashionable term in Data Science is storytelling. As important as doing a good statistical analysis of our data is to be able to communicate our results to others.

If we have performed a great analysis and extracted useful conclusions but are not able to communicate them to those taking the decisions, our analysis will be worth nothing.

Good data visualizations can definitely help us tell stories, and communicate our results and ideas to others.

[Optional] How can data visualization help us illustrate world inequality
Global Wealth Inequality – What you never knew you never knew
Watch this 4-minute video to learn more: https://www.youtube.com/watch?v=uWSxzjyMNpU
Bar Chart
A bar chart shows how a categorical variable is distributed, i.e. how many observations of each category we have in our sample.

A bar chart is made up of columns/rows plotted on a graph, where each column/row represents a different class or category.

How to Interpret a Bar Chart

• The columns are positioned over a label that represents the category.

• The height of each column indicates the number of observations in each group.

The figure below shows a bar chart with the highest paid jobs in the US.

One can easily see that Physicians had the highest paid job, followed by Lawyers and Research & Development Managers.

Enlarged version: http://bit.ly/2mjAlBD
Histograms
When we are analyzing data, a common question that we ask ourselves is what is the distribution of a given variable.

For example, a manager of a bank who knows that the total debt of their customers amounts to $1000M might want to know how this debt is distributed.

Is it highly concentrated in a minority of accounts? Or is it that many accounts have a low debt?

Enlarged version: http://bit.ly/2nqSCSr
Characteristics of Histograms

A histogram is a useful graph that helps us answer these kinds of questions.

• Histograms show which elements occur most.
• Histograms show the minimum and maximum values observed in a dataset.
• Histograms show how the data is spread.
• The columns are positioned over a label that represents a quantitative variable.
• The column label can be a single value or a range of values.
• The height of the column indicates the size of the group defined by the column label.

[Optional] Histogram – What Is It?
Line Graphs
Line graphs are most useful when representing time series. This is a variable that is changing over time.

When visualizing time series as a line chart, the horizontal axisrepresents time and the vertical axis measures the variable. Each point of the plot corresponds to a timestamp, and points are joined by a continuous line.

For example, line graphs are widely used to represent the variation of stock prices. These visualizations are usually the first step for Stock Price Forecasting.

Enlarged version: http://bit.ly/2mSg6OY
Pie Charts
A pie chart is a circular statistical graphic divided into sliceswhere the area of each slice is proportional to the quantity it represents.

These charts are used to summarize categorical data.

For example, they are widely used to visualize electoral results, and the number of votes received by each party.

In the figure below, we can see the percentage of fans that Donald Trump and Hillary Cinton had on social media in August 2016.

Just by looking at the pie chart, one cannot tell the number of fans each candidate has, however we can quickly approximate the proportions.

Left panel: Facebook, Right panel: Twitter
Enlarged version: http://bit.ly/2nIi1D7
Criticism of Pie Charts

Although pie charts are widely used in business, experts have criticized them for being difficult to compare different sections of a given pie chart (specially when representing many categories), or to compare data across different pie charts.

They can be replaced by bar charts (which we explained earlier) or boxplots.*

* A boxplot is a graphical summary of the distribution of a sample that shows its shape, central tendency, and variability.
Scatter Plots
What are Scatter Plots?

Scatter plots are used to represent the relationship between two numerical variables. These plots show how much one variable is affected by another.

• If the variables are positively related, then when one of them increases, the other also tends to increase and vice versa.

• If the variables are negatively related, when one of them increases the other one decreases.

• Finally if the variables are not related, we will not find any of the previous patterns on the graph.

Sometimes, observing the relation is not straightforward so scatter plots also include a tendency line that shows the relation between the variables.

When performing an initial exploratory analysis, it’s very common to plot multiple scatter plots, visualizing all the possible pairs of numeric variables in the dataset.

Enlarged version: http://bit.ly/2njuUrt
Limits of Scatter Plots

What can cause a scatterplot to not be informative? When we have lots of data.

In such cases, it is common that several observations fall on the same point of the graph. Thus, we are not able to know in which space of the scatter plot there is a higher density of observations and will fail to estimate the tendency of the data.

A possible solution would be a contour plotwhich is a graph that you can use to explore the potential relationship between three variables.

Contour plots display a 3-dimensional relationship in two dimensions, with x- and y-factors (predictors) plotted on the x- and y-scales and response values represented by contours.

Contour Plots

A possible solution would be a contour plotwhich is a graph that you can use to explore the potential relationship between three variables.

Contour plots display a 3-dimensional relationship in two dimensions, with x- and y-factors (predictors) plotted on the x- and y-scales and response values represented by contours.  

[Optional] Data Visualization – What it is and why it matters
Jim Rohn Sứ mệnh khởi nghiệp