Simple Linear Regression

Simple linear regression seeks to find a linear relationship between two variables; the independent ‘predictor’ variable, and the dependent ‘outcome’ variable. We represent this linear relationship with a line, the slope of which allows us to make inferences as to the nature of the relationship. I.e., for every unit change in the predictor variable, the outcome variable changes by this amount.

The most commonly used method of creating that line is called “Ordinary Least Squares” (OLS). It is called this, because we try to minimize the sum of the squared residuals. Before this gets too in depth, we should define some terms.

  • The mean of a variable is the sum of all observations of that variable, divided by the number of observations.
  • x_i is the ith observation of the variable x.
  • y_i is the ith observation of variable y.
  • {\bar x} is the mean of the predictor variable x.
  • {\bar y} is the mean of the outcome variable y.
  • {\hat y}_i is the predicted value for variable y at observation i.
  • \varepsilon_i is the residual, or distance from the fitted value {\hat y} and the actual value y at observation i.

In order to use OLS, we must make a few assumptions as to the nature of the residuals. We must assume that the residuals are normally distributed (follow the normal or Gaussian distribution) with a mean of 0, and a constant variance (homoscedastiscity). There are more complex regression techniques that relax these assumptions that we will not be delving into here.

A number that scales a variable is called a ‘coefficient’. A number that stands alone is called a constant.

\beta_0 is a constant, and represents the y-intercept of the line. \beta_1 is a coefficient, and expresses the linear relationship between y and x. We represent the line we intend to draw with the equation:

    \begin{equation*} y_i = \beta_0 + \beta_1 x_i + \err_i \end{equation*}

Calculating the Coefficients

By using the ordinary least squares approach, we choose to minimize the sum of squared residuals.

    \begin{equation*} \min(Q_{(\beta_0,\beta_1)}) \text{ where }Q_{(\beta_0,\beta_1)} = \sum_{i=1}^{n}\varepsilon_i^2 = \sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2 \end{equation*}

Using Calculus, we can differentiate Q_{(\beta_0,\beta_1)} on \beta_0 and \beta_1, set the results equal to 0 and solve for \beta_0 and \beta_1. This yields our resulting estimators.

    \begin{equation*} {\hat \beta}_0 = {\bar y} - {\hat \beta}_1 {\bar x} \hspace{1cm}{\hat \beta}_1 = \frac{\sum_{i=1}^n(y_i - {\bar y})(x_i - {\bar x})}{\sum_{i=1}^n(x_i - {\bar x})^2} \end{equation*}

Up till now, none of this has required any assumptions as to the nature of the errors. Continuing on, we intend to make inferences and declare validity of the model, and for that we will need the assumption of normality of residuals.

Significance of the Regression

After generating a model, we test whether that model is significant.

    \begin{align*} &H_0\text{:  }\beta_1 = 0\text{ - The regression is not significant}\\ &H_1\text{:  }\beta_1 \neq 0 \text{ - The regression is significant}\\ \end{align*}

The statistic for this test follows the student’s t-test with degrees of freedom n-2.

    \begin{equation*} T = \frac{\bhat_1 - 0}{SE(\bhat_1)} \hspace{2cm}SE(\beta_1)\sim t_{n-2} = \sqrt{\frac{\frac{1}{n-2}\sum_{i=1}^n\err_i^2}{\sum_{i=1}^n(x_i-\xbar)^2}} \end{equation*}

Assuming the errors are normally distributed, this allows us to make several inferences. First, we can check whether the regression model is valid. To do that, we have to use the standard 2-sided hypothesis testing detailed in the hypothesis testing section. We’re using a t-distribution with n-2 degrees of freedom.

    \begin{align*} H_0\text{:  }\beta_1 = 0\\ H_1\text{:  }\beta_1 \neq 0\\ \end{align*}

So we calculate the t statistic for \bhat_1, and find the associated p value to that statistic. If the p value is less than \alpha, then we reject the null hypothesis and declare the regression model valid. For completeness sake, the standard Error of \bhat_0 can be calculated as:

    \begin{equation*} SE(\bhat_0) = \sqrt{\sigma^2(\frac{1}{n} + \frac{\xbar^2}{\sum_{i=1}^n(x_i - \xbar)^2})} \end{equation*}

Confidence Intervals

Using the derivations of the standard errors of the coefficients, and using the t-distribution nature of them, we can generate confidence interval around \beta_0 and \beta_1 such that we are 100(1-\alpha)\% sure the true regression coefficients lie within those bounds.

    \begin{align*} 100(1-\alpha)\%\text{ CI }\beta_1 &= \bhat_1 \pm t_{\frac{\alpha}{2}}*SE(\bhat_1)\\ 100(1-\alpha)\%\text{ CI }\beta_0 &= \bhat_0 \pm t_{\frac{\alpha}{2}}*SE(\bhat_0)\\ \end{align*}

Similarly, given the values, variances and covariances for the beta coefficients, we can generate confidence intervals for the predicted values (or means) of new y values for a given x_{new}.

    \begin{align*} 100(1-\alpha)\%\text{ CI }\yhat_{new} &= \bhat_0 + x_{\new}\bhat_1 \pm t_{\frac{\alpha}{2}}*SE(\yhat_{new})\\ SE(\yhat_{new}) &= \sqrt{\frac{1}{n} + \frac{(x_{new}-\xbar)^2}{\sum_{i=1}^{n}(x_i - \xbar)^2}} \end{align*}

Prediction Intervals
While confidence intervals give boundaries for the parameters of the regression line, prediction intervals operate slightly differently. With prediction intervals, we are 100(1-\alpha)\% sure that new observations will fall within the prediction lines. In other words, we’re not just accounting for the variance of the estimated parameters, but also for the variance of the observation itself. This has a quick and rather handy interpretation of simply adding the variances of the confidence interval and the error of the observation itself to form a new variance term.

    \begin{align*} 100(1-\alpha)\%\text{ PI } y_{new} &= \bhat_0 + x_{new}\bhat_1 \pm t_{\frac{\alpha}{2},n-2}SE(y_{new})\\ SE(y_{new}) &= \sqrt{1 + \frac{1}{n} + \frac{(x_{new} - \xbar)^2}{\sum_{i=1}^n(x_i-\xbar)^2}} \end{align*}