Simple linear regression seeks to find a linear relationship between two variables; the independent ‘predictor’ variable, and the dependent ‘outcome’ variable. We represent this linear relationship with a line, the slope of which allows us to make inferences as to the nature of the relationship. I.e., for every unit change in the predictor variable, the outcome variable changes by this amount.
The most commonly used method of creating that line is called “Ordinary Least Squares” (OLS). It is called this, because we try to minimize the sum of the squared residuals. Before this gets too in depth, we should define some terms.
- The mean of a variable is the sum of all observations of that variable, divided by the number of observations.
- is the th observation of the variable .
- is the th observation of variable .
- is the mean of the predictor variable .
- is the mean of the outcome variable .
- is the predicted value for variable at observation .
- is the residual, or distance from the fitted value and the actual value at observation .
In order to use OLS, we must make a few assumptions as to the nature of the residuals. We must assume that the residuals are normally distributed (follow the normal or Gaussian distribution) with a mean of 0, and a constant variance (homoscedastiscity). There are more complex regression techniques that relax these assumptions that we will not be delving into here.
A number that scales a variable is called a ‘coefficient’. A number that stands alone is called a constant.
is a constant, and represents the y-intercept of the line. is a coefficient, and expresses the linear relationship between and . We represent the line we intend to draw with the equation:
Calculating the Coefficients
By using the ordinary least squares approach, we choose to minimize the sum of squared residuals.
Using Calculus, we can differentiate on and , set the results equal to 0 and solve for and . This yields our resulting estimators.
Up till now, none of this has required any assumptions as to the nature of the errors. Continuing on, we intend to make inferences and declare validity of the model, and for that we will need the assumption of normality of residuals.
Significance of the Regression
After generating a model, we test whether that model is significant.
The statistic for this test follows the student’s t-test with degrees of freedom .
Assuming the errors are normally distributed, this allows us to make several inferences. First, we can check whether the regression model is valid. To do that, we have to use the standard 2-sided hypothesis testing detailed in the hypothesis testing section. We’re using a t-distribution with degrees of freedom.
So we calculate the statistic for , and find the associated value to that statistic. If the value is less than , then we reject the null hypothesis and declare the regression model valid. For completeness sake, the standard Error of can be calculated as:
Confidence Intervals
Using the derivations of the standard errors of the coefficients, and using the t-distribution nature of them, we can generate confidence interval around and such that we are sure the true regression coefficients lie within those bounds.
Similarly, given the values, variances and covariances for the beta coefficients, we can generate confidence intervals for the predicted values (or means) of new values for a given .
Prediction Intervals
While confidence intervals give boundaries for the parameters of the regression line, prediction intervals operate slightly differently. With prediction intervals, we are sure that new observations will fall within the prediction lines. In other words, we’re not just accounting for the variance of the estimated parameters, but also for the variance of the observation itself. This has a quick and rather handy interpretation of simply adding the variances of the confidence interval and the error of the observation itself to form a new variance term.