Univariate Linear Regression

Univariate Linear Regression

In this lesson, you’re expected to:
– understand how to perform a simple regression analysis
– learn about the Ordinary Least Squares method
– find out how to interpret the results of a regression analysis
What is Regression?

Regression analysis is a statistical methodology to estimate the strength and direction of the relationship between a dependent variable and one or more independent variables.

We can distinguish between simple regression (with just one independent variable) and multiple regression analysis (with two or more independent variables).

Simple regression allows us to estimate how the typical value of the dependent variable (Y) changes when the independent variable or predictor varies its value. One of the most widely used cases of regression analysis is prediction and forecasting.

Use of Regression

With regression, we can estimate real world variables, so it can be widely used in business to solve problems or make predictions. Some examples of questions that can be solved with regression are:

• How does the hashtag of a tweet affect the number of retweets it will receive? How does the number of followers a user has on Twitter condition the number of replies she will receive?
• How are sales affected by the price of a competitor?
• How does alcohol assimilation vary with the age of the driver?
• What will the accident rate of the customers of an insurance company be for the next year?
• How does temperature affect the pollution of a city? And the average rainfall?

Simple Linear Regression

Simple linear regression means that we’re assuming a linear relationship between the dependent variable and the predictor.

This means that we are drawing a straight line through the data, and using this line to predict additional values for yet-unobserved data points.

Thus the simple linear regression model, with a single predictor variable can be expressed as:
y = a + bx

Where y is the dependent variable that we are estimating and want to predict.
a is the constant term or intercept
x represents the predictor or independent variable
b is the slope parameter (how much Y changes in response to a change in X).

So the idea is: from all the possible lines that cross the data, find the one that best explains the relation between y and x. 

So the idea is, from all the possible lines that cross the data, find the one that best explains the relation between y and x. 
Ordinary Least Squares
There are different methods to find the optimum values of a and b. However, one of the most common ones is Ordinary Least Squares (OLS).

OLS is a widely used simple estimator in which the two parameters, a and b, are chosen to minimize the sum of the square of the residuals.

The residuals are defined as the difference between the observed value of the dependent variable and the predicted value.

Thus, given the set of samples (x,y),  the objective is to find the parameters a and b that minimize:
In the figure below:
 observations are marked by red dots,
 the residual is represented by a dotted line, and
• the blue line represents the final linear regression.

Thus, this blue line will be used to estimate the value of new unseen observations.

Advantages of OLS

• Some advantages of this method are that it is computationally very “cheap” to calculate the coefficients. We do not need a big cluster or lots of computational time.

• It is easier to interpret than more sophisticated models.

• Hence, when we give more importance to understanding the model than to its accuracy, it is a good solution.

[Optional] Ordinary Least Squares Regression
Check out this interactive explanation to learn more:
http://setosa.io/ev/ordinary-least-squares-regression/
Explanation of the relevant elements of the table

The left panel of the first table provides basic information about the model fit:

• Dep. Variable: The response in the model. The quantity we want to estimate.
• No. Observations: The number of observations or sample size.
• DF Residuals: Degrees of freedom of the residuals (Number of observations – number of parameters).
• DF Model: Number of parameters in the model (not including the constant term, if present)

Enlarged version: http://bit.ly/2n5xAb4

The right panel of the first table shows the goodness of fit:
• R-squared: A coefficient that evaluates the quality of the model. It is a statistical measure of how well the regression line approximates the real data points. Ranges from 0 to 1.

• Adj. R-squared: The above value adjusted based on the number of observations and the degrees-of-freedom of the residuals.

• F-statistic: A measure of how significant the fit is. The mean squared error of the model divided by the mean squared error of the residuals. Larger values of F indicate the model is more probable to be statistically significant. The interpretation of this number also depends on the number of samples, and the number of parameter of the model. In the case of simple linear regression, we always have the same number of parameters.

• Prob (F-statistic): The probability that you would get the above statistic, given the null hypothesis that they are unrelated. This is the p-value. Usually we accept the model if this value is below 0.05.

• Log-likelihood: The log of the likelihood function.

Enlarged version: http://bit.ly/2n5xAb4

The second table contains the information of the estimated parameters (the coefficients a and b):
• coef: The estimated value of the coefficient.

• stderr: The basic standard error of the estimate of the coefficient.

• t: The t-statistic value. This is a measure of how statistically significant the coefficient is.

• P > |t|: P-value that the null-hypothesis that the coefficient = 0 is true. Usually if this is less than 0.05, it indicates that there is a statistically significant relationship between the term and the response. If it is larger, there is not and we should remove that variable.

• [95.0% Conf. Interval]: The lower and upper values of the 95% confidence interval.

Enlarged version: http://bit.ly/2n5xAb4

[Optional] Simple Linear Regression
How good is our model?

Now let’s review the summary of our model and understand what is happening. We have to remember that we are trying to estimate the number of sales with the amount of dollars invested in advertising the products.

Interpreting the Results

How much of the variance is our model capable of explaining?

To answer this question we can look at the coefficient of determination, or R-squared, which measures the goodness of the adjustment.

It shows how much variance of Y is explained by the explanatory variables. In this case, how much variance of sales is explained by TV advertising.

This measure ranges from 0 (great dispersion) to 1 (data perfectly sticks to the estimated line). The value for our example is 0.612, thus TV advertising explains over 60% of the variance in the number of sales.

Is our model significant? Or should we accept the null hypothesis?

In this case, the null hypothesis would be that our model is no better than just estimating the number of sales of a product from its mean. We can take a look at the F-test of the global model and its associated p-value to answer this question.

The F-test takes a value of 312. As discussed before, we can make no claim about the significance only with this number as we have to combine it with the degrees of freedom and compare it to the F-distribution.

However, it’s easier to make a claim looking at its associated p-value that takes a value of 10E-47. This quantity is far lower than 0.05. Hence, we can reject the null hypothesis and claim that the model is significant.

Interpreting the Parameters

Is the effect of all variables significant? And how many extra sales will I get for every extra $10 that I spend on TV advertising? 

In this case, we are only interested in a single parameter in the model because we only have one predictor variable. Let’s take a look at it. The coefficient takes a value of around 0.05. What does this mean? How can I explain this to the manager deciding about the campaign of the new products?

This means that for every $100 I spend on TV ads, I get 5 more sales. So if an increase in $100 in the budget I will expect to sell 5 more products, and an increase of $1000 will produce 50 more sales. But is the effect of the TV advertising significant, or is it just a product of chance? The p-value associated to the coefficient is approximately zero, which is lower than 0.05. Therefore the effect is significant.

[Optional] How Can Linear Regression Be Applied in Business Settings?
Jim Rohn Sứ mệnh khởi nghiệp