Measures of Central Tendency

Measures of Central Tendency

In this lesson, you’re expected to:
– learn about the three main measures of central tendency: the mean, median, and mode.
– understand how finding the central value of a dataset can help with analysis.

Exploratory Statistics

Imagine that we have already collected, processed and cleaned our data.

Thus, at this point we have a table with millions of records.

The human brain cannot make any informative decision with this table, as it will probably be too big to read and summarize in our heads.

Hence, once we have collected the data and prepared it for our study, we need to compute some basic statistics in order to summarize the data.

Once we have summarized and characterized our dataset, we can start making decisions with it.

The most simple method to summarize the data is to compute the sample mean.

Finding the Center of Data

Knowing the center of a dataset is extremely important in business in order to make decisions.

For example, a health insurance company that is going to launch a new product (the price of the product will vary with the age of the subscriber) needs to know the average cost of a policy per age range.

By the same token, the manager of the insurance company must know the average age of the policies churning* the most – since for a health insurance company, it is vital to have a young portfolio.

* Churn Rate: the amount of customers or subscribers who cut ties with your service or company during a given time period.
Calculating the Mean

There are several methods to calculate the mean, however the most used is the arithmetic mean.

If you have a sample of n values, the sample mean is the sum of the values divided by the number of values:

X = 1/n * ∑xi

Where n is the number of samples and xi represents a single sample. It describes the central tendency of a sample.
Sample vs. Population

We must note that there is a difference between the mean of a sample and the mean of a population.

population includes all of the elements from a set of data, while a sample consists of a selection of observations chosen from the population.

For example, if we are analyzing the number of minutes that people between 25-35 years old spent connected to Facebook, the population would consist of all the individuals in that age range.

On the other hand, a sample would be a set of randomly chosen individuals aged between 25 and 35.

Outliers
Despite the mean being the most basic and important summary statistic, there are other measures such as the median and modethat can be more informative in several scenarios.

These other methods are especially useful when the mean leads to misleading results. This can happen when the data contains outliers. Outliers are a few observations that significantly differ from the rest of the elements of a dataset.

Example of an Outlier
Enlarged version: http://bit.ly/2mxfCv9
Is the mean represented in the Figure above a good summary of the data? Would you rely on this value to make a decision in your business?

Definitely not.

How an Outlier Distorts Data

Imagine the following scenario. 

We want to know the average income of  the typical US citizen. To do so, we randomly choose a shopping mall and ask a sample of 1000 persons about their salary. All of a sudden, Michael Jordan walks through the shopping mall and gets asked about his salary.

Before the appearance of Jordan the mean salary was $50,000 per year. Jordan reports an annual income of $100M. This quantity largely differs from the rest of the population.

Hence, if we compute the mean before Jordan walks into the shopping mall and after, the results would be completely different. The new mean is not representative of the average US citizen and is neither representative for Michael Jordan. In this case, the statistic was distorted by an outlier.

1) The Median

The median is another simple measure of central tendency. It divides the sample or population in half. This means that 50% of the observations are above the median, and the other 50% are below the median.

The simplest method to find the median of a dataset is to arrange the observations in order from smallest to largest value. The median is the middle value.

In case there is an even number of observations, the median corresponds to the average of the two middle values.

Relation between the Mean and Median

In the previous example, we saw that the mean was several orders of magnitude larger than the median, but what does this imply?

What can we learn about the relation between the mean and the median?

• If the distribution of the dataset is symmetrical (panel B), then the mean = median, since both tails are balanced.

• If the distribution is negatively skewed, there is a left tail, and the mean < median.

• On the other hand, if the distribution is positively skewed, there is a tail to the right, and the mean > median.

Enlarged version: http://bit.ly/2m4iymH
Should I use the Mean or the Median to characterize my Data?

The mean is the most widely known method to measure the center of a dataset. Hence, in business everyone will know what a mean is, whereas not everyone might understand the median.

However, as we saw in the previous example about salaries, when the data is skewed, the mean is not informative as it is affected by extreme values and outliers.

Thus, when our dataset has a skewed distribution, the median is much more representative of our sample or population as it is less affected by extreme observations.

The figure below shows the distribution of health costs of policies from a health insurance company. The sample corresponds to policies where the age of the policyholder ranges between 50 and 65.

If the manager of the company wants to characterize the average cost of the policies in that age segment what measure should he use, the median or mean? Why?*

Enlarged version: http://bit.ly/2mMprGq
The distribution is positively skewed. This is typical in health costs where the cost of the majority of policies is very low, with a minority of policies having an enormous cost.

In this case, the median is more representative.

[Optional] What is a Skewed Distribution?
2) The Mode

The mode is the value that appears most often in a dataset.

The mode of a discrete variable is the value x at which its histogram takes its maximum value.

Use of the Mode

When we are analyzing categorical data (that is, non-numerical data), we cannot compute the mean or median.

Hence, the mode becomes very useful when dealing with categorical data.

Another important characteristic of the mode is that we can have more than one mode in a dataset.

Why? Suppose there are two categories that appear with the same frequency in the dataset. In that case, the dataset would have two modes.

By the same token, if no category is repeated in the dataset, there would be no mode.

[Optional] Finding mean, median, and mode
Watch this 4-minute video to learn more: https://www.youtube.com/watch?v=k3aKKasOmIw
[Optional] Measures of Central Tendency
Jim Rohn Sứ mệnh khởi nghiệp