Measures of Central Tendency
In this lesson, you’re expected to:
– learn about the three main measures of central tendency: the mean, median, and mode.
– understand how finding the central value of a dataset can help with analysis.
Imagine that we have already collected, processed and cleaned our data.
Thus, at this point we have a table with millions of records.
The human brain cannot make any informative decision with this table, as it will probably be too big to read and summarize in our heads.
Hence, once we have collected the data and prepared it for our study, we need to compute some basic statistics in order to summarize the data.
Once we have summarized and characterized our dataset, we can start making decisions with it.
The most simple method to summarize the data is to compute the sample mean.
Knowing the center of a dataset is extremely important in business in order to make decisions.
For example, a health insurance company that is going to launch a new product (the price of the product will vary with the age of the subscriber) needs to know the average cost of a policy per age range.
By the same token, the manager of the insurance company must know the average age of the policies churning* the most – since for a health insurance company, it is vital to have a young portfolio.
There are several methods to calculate the mean, however the most used is the arithmetic mean.
If you have a sample of n values, the sample mean is the sum of the values divided by the number of values:
X = 1/n * ∑xi
We must note that there is a difference between the mean of a sample and the mean of a population.
A population includes all of the elements from a set of data, while a sample consists of a selection of observations chosen from the population.
For example, if we are analyzing the number of minutes that people between 25-35 years old spent connected to Facebook, the population would consist of all the individuals in that age range.
On the other hand, a sample would be a set of randomly chosen individuals aged between 25 and 35.
These other methods are especially useful when the mean leads to misleading results. This can happen when the data contains outliers. Outliers are a few observations that significantly differ from the rest of the elements of a dataset.
Imagine the following scenario.
We want to know the average income of the typical US citizen. To do so, we randomly choose a shopping mall and ask a sample of 1000 persons about their salary. All of a sudden, Michael Jordan walks through the shopping mall and gets asked about his salary.
Before the appearance of Jordan the mean salary was $50,000 per year. Jordan reports an annual income of $100M. This quantity largely differs from the rest of the population.
Hence, if we compute the mean before Jordan walks into the shopping mall and after, the results would be completely different. The new mean is not representative of the average US citizen and is neither representative for Michael Jordan. In this case, the statistic was distorted by an outlier.
The simplest method to find the median of a dataset is to arrange the observations in order from smallest to largest value. The median is the middle value.
In case there is an even number of observations, the median corresponds to the average of the two middle values.
In the previous example, we saw that the mean was several orders of magnitude larger than the median, but what does this imply?
What can we learn about the relation between the mean and the median?
• If the distribution is negatively skewed, there is a left tail, and the mean < median.
• On the other hand, if the distribution is positively skewed, there is a tail to the right, and the mean > median.
However, as we saw in the previous example about salaries, when the data is skewed, the mean is not informative as it is affected by extreme values and outliers.
Thus, when our dataset has a skewed distribution, the median is much more representative of our sample or population as it is less affected by extreme observations.
If the manager of the company wants to characterize the average cost of the policies in that age segment what measure should he use, the median or mean? Why?*
In this case, the median is more representative.
The mode is the value that appears most often in a dataset.
The mode of a discrete variable is the value x at which its histogram takes its maximum value.
Hence, the mode becomes very useful when dealing with categorical data.
Why? Suppose there are two categories that appear with the same frequency in the dataset. In that case, the dataset would have two modes.
By the same token, if no category is repeated in the dataset, there would be no mode.