Multiple Linear Regression
In this lesson, you’re expected to learn about multiple regression, an extension of simple linear regression. Discover how to perform an analysis when you want to predict the value of a variable based on the value of two or more other variables.
What is Multiple Regression?
Multiple regression analysis, is used to estimate the relationship between a dependent variable (Y) and two or more independent variables (X1, X2…).
Having more than one independent variable introduces some complications into multiple regression analysis.
For example, extra testing must be done to validate the results of a multiple regression model. Interpreting the results also becomes more difficult, and we need to be careful to not include lots of uninformative variables that overfit the model.
In the case of multiple linear regression, we extend this idea by fitting a p-dimensional hyperplane for our p predictors.
Thus, we can express our dependent variable as a linear combination of the predictors:
For example, in our dataset we spent on advertising the products in three media – TV, Newspaper, and Radio. It is also logical that the products that were advertised a lot on TV, were also advertised a lot on the Radio. So our predictor variables may be correlated.
What does this mean?
This means that despite the effect of TV advertising being significant when considered alone, it could not be significant when considered together with Radio and newspaper. Its coefficient can also change. It could be that the products with more sales got more advertising on both TV and Radio. Thus, the observed effect for TV was just by chance – it was really the advertising on Radio that got that extra sales for those products.
We are going to express number of sales as a function of those three variables, and quantify its effect together, and determine if the three are significant or some of the variables may not be significant.
As we did before, we need to check the Coefficient of Determination, or R-squared, which measures the goodness of the adjustment.
For the multivariate model, the value increases to 0.897. This means that the advertising in these three media explains almost 90% of the variance in sales.
What is the adjusted R-square, and why is it needed?
First, every time you add a predictor to a model, the R-squared increases, even if it is just due to chance. It never decreases. Consequently, a model with more terms will have a higher R, simply because it has more terms. In fact if you have N observations, you will always be able to find a linear combination of N random variables whose R-squared is equal to 1. Hence I can’t compare the R-squared of models with different number of predictors.
Thus, when a model has too many predictors and higher order polynomials, it begins to model the random noise in the data. This condition is known as overfitting the model and it produces misleadingly high R-squared values and a lessened ability to make predictions.
Suppose we are comparing a one-predictor model with a lower R-squared to a four-predictor model (that includes the variable of the one-predictor model).
Does the four predictor model have a higher R-squared because it’s better? Or is the R-squared higher just because it has more predictors?
To answer this question, we have to compare the adjusted R-squared values to find out! If we compare the R-squared, the model with four variables will always win. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance, and decreases when a predictor does not improve the model as expected by chance.
Is our overall model significant? Or should we accept the null hypothesis.
The null hypothesis would be that our model is no better than just estimating the number of sales of a product from its mean. To this end, just in the same way as in the simple linear regression, we can rely on the F-test of the model and its associated p-value to answer this question.
In this case, F takes a value of 570 and its p-value is equal to 10E-96. Thus, the model is significant. But in the multiple variable case, what does this mean? That the effect that all variables is significant? No, if the effect of just one variable is significant, the model will be significant and the null hypothesis is rejected.
Proof that this can be happening is a gap between the R-square and the adjusted R-square.
To further evaluate this issue, and see which variables are not significant, we will need to rely on the coefficients table and evaluate the p-value associated with each parameter.
TV: 0.0458, Radio: 0.18, Newspaper: -0.001
This means that if you invest an extra $100 in each of the media, you will expect to increase the number of sales in the following way:
(100 x 0.0458) + (100 x 0.18) – (100 x 0.001) = 22
18 is explained by Radio and 4 by TV.