Analysis of Variance (2/2)

Analysis of Variance (2/2)

In this lesson, you’re expected to:
– understand how to interpret an ANOVA table
– learn how and when to conduct a Two-Way ANOVA test

The table below shows the summary of the results of a standard ANOVA test.

In this case, we have performed a One-Way ANOVA test to evaluate whether the brand of a car determines the miles per gallon (mpg) of the model.

In other words, whether cars of different brands significantly differ in their fuel consumption. So our dependent variable is mpg and our factor is the brand of the car.

In total, we are analyzing 398 different cars designed by 37 different brands (such as Ford, Chevrolet etc.)

Enlarged version: http://bit.ly/2nGHnWm
Let’s review all the rows and columns of the table.

1) Rows

i) Brand: In this row, we have the measures associated to the between-groups variability. This is the variability explained by the factor variable. In the current example, the variability in mpg is explained by the brand of the car. It’s name corresponds to the factor variable.

ii) Residual: In this row, we find the information related to the within-group variability. This is the variability not explained by the factor variable. In the current example, the variability in mpg is not explained by the brand of the car.

Enlarged version: http://bit.ly/2nGxvMa
2) Columns

i) sum_sq (Sum of the Squares)
The first row contains SSG*. In this case, it takes the value 10,562. This is the sum of the squares between groups. This is a measure of how much variation there is among the mean mpg of the different brands.

In the second row (residuals), SSE** is shown. It takes the value 13,690, which indicates the sum of the squares within the same brand. This is a measure of sampling error.

*SSG = the sum of squares between groups
**SSE = the sum of squared errors

Enlarged version: http://bit.ly/2nGxvMa

ii) df (Degrees of Freedom)

For the between-groups variability row, this takes a value of 36. Where does this quantity come from? It’s given by k-1.

Since there are 37 different brands, df = 37 – 1 = 36. For the within-groups variability row, df takes a value of 361. This quantity is computed by: N – k. In this case, 398 – 37 = 361.

Enlarged version: http://bit.ly/2nGxvMa
[Optional] Degrees of Freedom
iii) F (F-statistic)

This column only has a value for the first row. It shows the result of the F-test described in the previous lesson.

Remember that this measure is a ratio of the MSG (which measures the between-group variability) and MSE (which measures the variability within each of the groups). Hence, the higher the value, the larger the fraction of variability explained by the factor variable, which therefore increases the chances of rejecting the null hypothesis.

However, as we have seen, this quantity cannot be interpreted alone as it’s related to the number of observations and the number of groups in the factor variable. Thus, for direct interpretations, it is better to compute the associated p-value.

iv) PR (>F)

This column contains the p-value. For one-way ANOVA (this is the case of the example), there will be just one p-value for the column with the factors. In this case, the p-value indicates the probability that the difference observed in the mpg among brands is due to chance.

Thus, low values imply that we can reject the null hypothesis. Remember that we usually set this threshold to 0.05. In this case 10E⁻27 is extremely low. Therefore, we will reject the null hypothesis and affirm that the brand of the car determines its fuel consumption.

Two-Way ANOVA

An important limitation of the One-Way ANOVA method is that we are testing for one factor variable at a time.

For example, the manager of Inditex might be interested in the impact that both color and type of pant have on the sales of pants. And maybe he is also interested in knowing if both factors interact (maybe beige pants are significantly more demanded than all the other possible combinations).

With One-Way ANOVA, if we want to test the effect of those two factor variables, we would need to conduct two separate and independent analysis. Thus, we are not able to test for both factors together.

Suppose we perform the two one-way ANOVA tests for the type and the color, and find no evidence in difference between population means. But maybe there is a real trend that shows that beige chinos are more demanded and we had failed to find the pattern.

With Two-Way ANOVA, we can test for both effects at the same time and for their interaction.

Thus, performing the latter analysis we would have succeeded in finding a trend and showing that significantly more beige chinos are sold.

Two-Way ANOVA is a more complex ANOVA model that considers more than one factor simultaneously. 

It analyzes the influence of two independent categorical variables in a continuous dependent variable. It can also consider the possible interaction (product) among the two factors.

For example, neither the effect of the color nor the effect of the type are significant, but the interaction term is significant because beige chinos have more sales. This method is a generalization of the same ideas of the one-way model. With this model, we can obtain the variance explained by each factor, and whether each factor and the interaction term are relevant or not.

Two-Way ANOVA Table

The table of a two-way ANOVA is very similar to the one-way ANOVA. The main difference is the appearance of extra rows.

We can perform a two-way ANOVA with or without interaction. If we do not consider the interaction term among the factors, the table has three rows. The first two correspond to the factors and the last one to the residuals.

The Figure below shows a table for the car example. Remember that the dependent variable is mpg and in this case, the factor variables are the number of cylinders and the brand.

Enlarged version: http://bit.ly/2nOudmK
The easiest way to evaluate whether there is a significant effect of the factor variables on the dependent variable (mpg) is by looking at the p-value. This is the PR (>F) column.

Thus, we can see that both the brand and the number of cylinders of the engine have a significant effect on the mpg of a car.

In the table below, we have performed the same two-way ANOVA but included the interaction term. Now there is an extra row that shows the information for this new term.
Here’s a question:
Do the brand and number of cylinders have a significant influence on the mpg, and can we reject the null hypothesis? Is the interaction term significant?

Answer: Yes, both factors are still significant. The interaction term is also significant as it’s p-value is equal to 4 10e-15.

There are many situations in which that will be true. A simple one would be when evaluating the effect of the month and day on temperature.

Month by itself will definitely condition the temperature. As temperatures are higher in August than in December.

However, the day will not. Day 31 by itself will probably not condition the temperature, as the temperature expected on day 31 during winter and day 31 of summer will be completely different. However, the interaction term will definitely influence temperature, as the temperature of the 31st of August will probably be higher than the temperature of the 31st of December.

[Optional] Box Plot: Display of Distribution
Check this link to learn more:
http://www.physics.csbsju.edu/stats/box2.html
Jim Rohn Sứ mệnh khởi nghiệp