Confidence Intervals & Hypothesis Testing

Confidence Intervals & Hypothesis Testing

In this lesson, you’re expected to:
– learn what confidence intervals are and how to identify them
– understand how to use hypothesis testing with confidence intervals

Let’s begin this lesson by revising the standard error and the Central Limit Theorem, as they will be useful to understand confidence intervals.

Standard Error of the Mean

The standard error (SE) of a sample mean is the estimated standard deviation of the error in the process by which it was generated.

In other words, it is the standard deviation of the sampling distribution of the sample statistic.

The formula to calculate standard error is the following:

* SE decreases as the size of the sample (n) increases.
Central Limit Theorem

If a sample consists of at least 30 independent observations and the distribution of the underlying population data is not strongly skewed, then the distribution of the sample means is well approximated by a normal distribution.
[Optional] Understanding the Central Limit Theorem
What is a Confidence Interval?

Informally, we can say that a confidence interval is a range of numbers in which the true value of a parameter is likely to fall.

Confidence intervals complement an estimation, as an estimation does not provide information of how much the parameter varies from observation to observation.

A point estimate is rarely perfect and usually contains some error or uncertainty.

Hence, instead of just estimating the value of a parameter, the next step is to provide a plausible range of values for the parameter. And this is what a confidence interval provides.

For example, imagine we want to know how many customers we expect to call customer service daily. 

The first step would be to make an estimation of the parameter(the number of customers calling). For this purpose, we can compute the mean. Let’s say we have a mean of 500 customers calling daily.

Next, we need to know how much this quantity varies from day to day. The number of telephone operators that the manager will hire would be very different if he expects peaks of around 1000 calls than if he only expects maximum values of 600. Thus, we need to calculate confidence intervals. Once we have calculated them, we will know that for example, we can expect between 300 and 700 calls per day.

If the distribution of a point estimate approximately follows a normal distribution, its standard error represents the standard deviation associated with the estimate.

In this case, we know that roughly 95% of the time, the estimation will fall within 2 standard errors of the parameter.

In other words, there is a 95% chance that the sample mean is between μ ± 2 SE.

Hence, in such a case, we can say that we are 95% confident that we have captured the true parameter.

Enlarged version:
What does being 95% confident mean?

Suppose we take lots of samples to estimate the mean of a population and build a confidence interval from each sample.

Then, 95% of those intervals will contain the actual value of the mean.

[Optional] Sampling Distributions and Confidence Intervals
Read this article to learn more:
Identify Confidence Intervals
Creating more Confidence Intervals

Remember: why did we choose 2 standard errors to build the confidence interval? 

The decision was based on the general guideline that around 95% of the time, observations fall within two standard deviations of the mean.

However, for the normal distribution we can be more accurate as the exact value is 1.96 rather than 2.

CI (95%) = Estimate ± (1.96 × SE)

With the previous formula, we are limited to being 95% confident. 

What if the manager of the company wants to be 99% confidentabout the result?

How to Build Confidence Intervals

In general, if the point estimate follows the normal model, we can build any confidence interval for the estimate parameter with the following formula:

Estimate ± (z × SE)

* z corresponds to the confidence level selected.

The value of z can be obtained from the Z table:

Below is a summary of commonly used confidence intervals with their respective z-values.

Hence, for a 99% confidence interval, the value of z corresponds to 2.58.

How do we interpret the 99% confidence interval?
In 99% of the cases, if you compute the confidence interval from a sample, the true value of the parameter for the population will fall within ± CI (99%).

Enlarged version:
In the figure below, we have computed the mean and its 95% confidence interval for 100 samples.

Green dots represent the average number of sales, and lines the confidence interval. Red lines are used to highlight those confidence intervals that do not contain the actual mean of the number of sales. The black horizontal line shows this value.

Does the figure make sense? Do the number of samples for which the confidence interval does not include the actual value seem logical?*

Enlarged version:
We are 95% sure that the confidence intervals will contain the actual value of the mean. Hence, we would expect that if we take infinite samples, for 5% of them, the confidence interval would not contain the right value, while the remaining 95% of the samples will.

In the figure, we can appreciate that only 3 out of 100 do not contain the actual value. These three cases are highlighted in red. This value represents 3% which is very close to 5%. Thus, the result is logical.

Hypothesis Testing
Hypothesis Testing with Confidence Intervals

A bank is performing an analysis of their customers who are closing their account. In particular, the manager is interested in knowing whether those leaving the bank have on average more or less money in their account than those who stay.

How do we proceed? 
To answer this question, we need hypothesis testing.

1) First we separate our clients into two populations. The first one includes all the account holders who did not close their account last month while the second group is composed of those who kept their account open.

2) Compute the mean for each group and compare the means of both groups. The result is that those leaving had on average $32,000 and those who stayed had only $30,000.

So there is a difference in means, but is the observed effect statistically significant?

So now we have two hypotheses:

• H0: The money in the accounts of those leaving and those staying is the same. This is the null hypothesis, that states that the observed effect is due to chance or a bias in the sample.

• H1: The mean amount of money in the accounts of those leaving is higher than the mean amount of money in the accounts of those who stay. H1 is known as the alternative hypothesis which is what you believe or hope to prove to be true.

Now, we can use confidence intervals to test whether our hypothesis H1 is true and if we can reject the null hypothesis or not.

To this end, we compute the confidence interval of the mean estimation for both populations.

Suppose the 95% confidence interval of the money of those leaving the bank is [$29,000, $33,000]. Since the mean of those staying is $30,000 and this value is in the range of probable values, we cannot reject the null hypothesis.

Jim Rohn Sứ mệnh khởi nghiệp