Data Visualization
In this lesson, you’re expected to:
– understand the importance of presenting data in an effective manner
– learn about different types of visualization methods
Why is Data Visualization important in Business Statistics?
Data visualization is very important in business statistics mainly because of two reasons:
1) Exploratory Analysis
Once we have collected the data for our study, the first step is to explore the data.
Graphs help us understand our data – find possible relationships between variables, find outliers, learn how the data is distributed etc.
An early visualization of the data will help us to gain intuition about it and formulate better hypotheses. Visually exploring our data is a must, before we can start performing more advanced statistics or modelling.
2) Communicating Results
A fashionable term in Data Science is storytelling. As important as doing a good statistical analysis of our data is to be able to communicate our results to others.
If we have performed a great analysis and extracted useful conclusions but are not able to communicate them to those taking the decisions, our analysis will be worth nothing.
Good data visualizations can definitely help us tell stories, and communicate our results and ideas to others.
Watch this 4-minute video to learn more: https://www.youtube.com/watch?v=uWSxzjyMNpU
A bar chart is made up of columns/rows plotted on a graph, where each column/row represents a different class or category.
• The columns are positioned over a label that represents the category.
• The height of each column indicates the number of observations in each group.
One can easily see that Physicians had the highest paid job, followed by Lawyers and Research & Development Managers.
For example, a manager of a bank who knows that the total debt of their customers amounts to $1000M might want to know how this debt is distributed.
Is it highly concentrated in a minority of accounts? Or is it that many accounts have a low debt?
A histogram is a useful graph that helps us answer these kinds of questions.
• Histograms show which elements occur most.
• Histograms show the minimum and maximum values observed in a dataset.
• Histograms show how the data is spread.
• The columns are positioned over a label that represents a quantitative variable.
• The column label can be a single value or a range of values.
• The height of the column indicates the size of the group defined by the column label.
https://www.spss-tutorials.com/histogram-what-is-it/
When visualizing time series as a line chart, the horizontal axisrepresents time and the vertical axis measures the variable. Each point of the plot corresponds to a timestamp, and points are joined by a continuous line.
For example, line graphs are widely used to represent the variation of stock prices. These visualizations are usually the first step for Stock Price Forecasting.
These charts are used to summarize categorical data.
In the figure below, we can see the percentage of fans that Donald Trump and Hillary Cinton had on social media in August 2016.
Just by looking at the pie chart, one cannot tell the number of fans each candidate has, however we can quickly approximate the proportions.
Enlarged version: http://bit.ly/2nIi1D7
Although pie charts are widely used in business, experts have criticized them for being difficult to compare different sections of a given pie chart (specially when representing many categories), or to compare data across different pie charts.
They can be replaced by bar charts (which we explained earlier) or boxplots.*
Scatter plots are used to represent the relationship between two numerical variables. These plots show how much one variable is affected by another.
• If the variables are positively related, then when one of them increases, the other also tends to increase and vice versa.
• If the variables are negatively related, when one of them increases the other one decreases.
• Finally if the variables are not related, we will not find any of the previous patterns on the graph.
When performing an initial exploratory analysis, it’s very common to plot multiple scatter plots, visualizing all the possible pairs of numeric variables in the dataset.
In such cases, it is common that several observations fall on the same point of the graph. Thus, we are not able to know in which space of the scatter plot there is a higher density of observations and will fail to estimate the tendency of the data.
A possible solution would be a contour plot, which is a graph that you can use to explore the potential relationship between three variables.
Contour plots display a 3-dimensional relationship in two dimensions, with x- and y-factors (predictors) plotted on the x- and y-scales and response values represented by contours.
A possible solution would be a contour plot, which is a graph that you can use to explore the potential relationship between three variables.
Contour plots display a 3-dimensional relationship in two dimensions, with x- and y-factors (predictors) plotted on the x- and y-scales and response values represented by contours.
https://www.sas.com/en_us/insights/big-data/data-visualization.html