5  Simple Linear regression

5.1 SLR

Definition 5.1 (A First-Order Model) \[ y=\beta_0+\beta_1x+\varepsilon \] where

  • \(y\): The response variable
  • \(x\): The independent variable (variable used as a predictor of \(y\))
  • \(\operatorname{E}(y)=\beta_0+\beta_1x\): Deterministic component
  • \(\varepsilon\): Random error component
  • \(\beta_0\): \(y\)-intercept
  • \(\beta_1\): Slope

For \(\varepsilon\), we have the following assumptions:

  • \(\varepsilon\) follows Normal distribution for all \(x\).
  • The mean is constant to be \(0\).
  • The variance is constant \(\sigma^2\) (homoscedastic).
  • The errors associated with two observations are independent.

In other words, \(\varepsilon_i\sim N(0,\sigma^2)\) i.i.d.

Definition 5.2 (Regression equation) Simple linear regression equation is \[ \operatorname{E}(y)=\beta_0+\beta_1x. \]

Definition 5.3 (Estimated regression equation) The estimated simple linear regression equation is \[ \hat y=\hat\beta_0+\hat\beta_1x. \] \(\hat\beta_0\) and \(\hat\beta_1\) are estimators of \(\beta_0\) and \(\beta_1\).

The purpose of the section is to estimate \(\beta_0\), \(\beta_1\) as well as \(\sigma^2\).

5.2 Estimation

5.2.1 OLS estimators

The standard approach is known as ordinary least squares (OLS) estimation. In this method, model parameters are chosen to minimize the sum of squared residuals between the observed responses and the values predicted by the model. From an optimization perspective, OLS minimizes a loss function that measures the discrepancy between the data and the fitted model.

Here are more details.

Consider the linear regression model \[ y_i=\beta_0+\beta_1x_i+\varepsilon_i,\quad i=1,\ldots, n. \]

In order to find the best parameters, we need to minimize the sum of squared errors \[ L(\beta_0, \beta_1) = \sum_{i=1}^n\varepsilon_i^2=\sum_{i=1}^n(y_i-\beta_0-\beta_1x_i)^2. \] Since this loss can be treated as a function of \(\beta_0\) and \(\beta_1\), to minimize it, we could consider the critical point, which is

\[ \pdv{L}{\beta_0}=0,\quad \pdv{L}{\beta_1}=0. \] In other words, we have

\[ \begin{split} \pdv{L}{\beta_0}&=\sum_{i=1}^n2(y_i-\beta_0-\beta_1x_i)(-1)=-2\qty(\sum_{i=1}^ny_i-n\beta_0-\beta_1(\sum_{i=1}^nx_i))=-2n(\bar y-\beta_0-\beta_1\bar x)=0,\\ \pdv{L}{\beta_1}&=\sum_{i=1}^n2(y_i-\beta_0-\beta_1x_i)(-x_i). \end{split} \] The first equation gives us \(\beta_0=\bar y-\beta_1\bar x\). Then the second equation is \[ \begin{split} \pdv{L}{\beta_1}&=\sum_{i=1}^n2(y_i-\beta_0-\beta_1x_i)(-x_i)=-2\sum_{i=1}^n(y_i-\bar y+\beta_1\bar x-\beta_1x_i)(x_i)\\ &=-2\sum_{i=1}^n(y_i-\bar y-\beta_1(x_i-\bar x))(x_i)=-2\sum_{i=1}^n(y_i-\bar y-\beta_1(x_i-\bar x))(x_i-\bar x)\\ &=-2\qty(\sum_{i=1}^n(y_i-\bar y)(x_i-\bar x)-\beta_1\sum_{i=1}^n(x_i-\bar x)(x_i-\bar x))\\ &=-2\qty(SS_{xy}-\beta_1SS_{xx})=0. \end{split} \]

Therefore, the solution to the equations, which is the best parameters that minimize the sum of square errors, is

Theorem 5.1 \[ \hat{\beta}_1=\frac{SS_{xy}}{SS_{xx}},\quad \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}. \]

5.2.2 MLE estimators

Since we assume \(\varepsilon_i\sim \mathcal N(0,\sigma^2)\) i.i.d, we could actually compute the likelihood function explicitly. Recall that \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\), then \[ \begin{split} p(y_i\mid x_i,\beta_0,\beta_1,\sigma^2)=\frac{1}{\sqrt{2\pi \sigma^2}}\exp\qty(-\frac{(y_i-\beta_0-\beta_1x_i)^2}{2\sigma^2}). \end{split} \] Take all data into consideration. Let \(\mathbf x=(x_1,\ldots,x_n)\) and \(\mathbf y=(y_1,\ldots,y_n)\). Assuming that all \(\varepsilon_i\)’s are independent, then the likelihood function is \[ L(\beta_0,\beta_1,\sigma^2\mid \mathbf x, \mathbf y)=\prod_{i=1}^n\frac1{\sqrt{2\pi\sigma^2}}\exp\qty(-\frac{(y_i-\beta_0-\beta_1x_i)^2}{2\sigma^2}). \] We could take negative log to make it simpler \[ \operatorname{nll}(\beta_0,\beta_1,\sigma^2\mid \mathbf x, \mathbf y)=\frac n2\ln\qty(2\pi\sigma^2)+\frac1{2\sigma^2}\sum_{i=1}^n(y_i-\beta_0-\beta_1x_i)^2. \] This derived loss function is called the negative log-likelihood (NLL). Maximizing the likelihood is equivalent to minimizing \(\operatorname{nll}\).

For fixed \(\sigma^2\), minimizing \(\operatorname{nll}\) over \((\beta_0,\beta_1)\) is equivalent to minimizing \(\sum_{i=1}^n(y_i-\beta_0-\beta_1x_i)^2\), so the MLE of \((\beta_0,\beta_1)\) is the same as the OLS estimator.

To estimate \(\sigma^2\), treat \(\sigma^2\) as a variable and differentiate: \[ \pdv{\operatorname{nll}}{\sigma^2}=\frac{n}2\frac1{\sigma^2}-\frac12\frac{1}{(\sigma^2)^2}\sum_{i=1}^n(y_i-\beta_0-\beta_1x_i)^2=0. \] Setting this derivative to zero yields

Theorem 5.2 \[ \hat\sigma_{MLE}^2=\frac1n\sum_{i=1}^n(y_i-\hat\beta_0-\hat\beta_1x_i)^2=\frac{SSE}{n}. \]

Similar to the sample variance scenario, the MLE estimator for variance is biased. The unbiased estimator of the error variance is the sample variance of the residuals. \[ s^2=\frac{SSE}{n-2}. \]

5.2.3 Properties of these estimators

Theorem 5.3 \[ \operatorname{E}(\hat{\beta}_0)=\beta_0, \quad \operatorname{E}(\hat{\beta}_1)=\beta_1,\quad \operatorname{E}(s^2)=\sigma^2. \]

Click for proof.

The key point is to treat all \(x_i\)’s as constants, and \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\) is a random variable due to \(\varepsilon_i\). Note that \(\operatorname{E}(y_i)=\beta_0+\beta_1x_i\) and \(\operatorname{E}(\bar y)=\frac1n\sum_{i=1}^n\operatorname{E}(y_i)=\beta_0+\beta_1\bar x\). \[ \begin{split} \operatorname{E}(SS_{xy})&=\operatorname{E}\qty(\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y))\\ &=\sum_{i=1}^n(x_i-\bar x)(\operatorname{E}(y_i)-\operatorname{E}(\bar y))\\ &=\sum_{i=1}^n(x_i-\bar x)(\beta_0+\beta_1x_i-\beta_0-\beta_1\bar x)\\ &=\sum_{i=1}^n(x_i-\bar x)\beta_1(x_i-\bar x)\\ &=\beta_1SS_{xx}. \end{split} \] Therefore \[ \operatorname{E}(\hat\beta_1)=\operatorname{E}\qty(\frac{SS_{xy}}{SS_{xx}})=\beta_1. \] We now show that \(\hat\beta_0\) is an unbiased estimator of \(\beta_0\). Recall that \[ \hat\beta_0 = \bar y - \hat\beta_1 \bar x. \] Taking expectations and using linearity of expectation, \[ \begin{aligned} \operatorname{E}(\hat\beta_0) &= \operatorname{E}(\bar y) - \bar x\,\operatorname{E}(\hat\beta_1) \\ &= (\beta_0 + \beta_1 \bar x) - \bar x\,\beta_1 \\ &= \beta_0. \end{aligned} \] Therefore, \(\hat\beta_0\) is an unbiased estimator of \(\beta_0\).

For \(s^2\), it requires more knowledge from \(\chi^22\) distribution.

  • \(\varepsilon_i\sim \mathcal N(0,\sigma^2)\) i.i.d, and \(x_i\)’s are all constants.
  • \(e_i=y_i-\hat y_i=\varepsilon_i-(\hat\beta_0-\beta_0)+(\hat\beta_1-\beta_1)x_i\). Note that \(\hat\beta_0-\beta_0\) and \(\hat\beta_1-\beta_1\) are functions of all \(\varepsilon_i\)’s so each residual depends on the entire error vector.
  • Since \(SSE=\sum_{i=1}^ne_i^2\), by Theorem 2.3, one can show that \(\frac{SSE}{\sigma^2}\sim \chi_{n-2}^2\). The loss of 2 degrees of freedom comes from estimating \(\beta_0\) and \(\beta_1\).
  • Therefore \(\operatorname{E}\qty(\frac{SSE}{\sigma^2})=n-2\).
  • Therefore \(\operatorname{E}(s^2)=\frac1{n-2}\operatorname{E}(SSE)=\sigma^2\).

Theorem 5.4 \[ \operatorname{Var}(\hat{\beta}_1)=\frac{\sigma^2}{SS_{xx}}, \quad \operatorname{Var}(\hat{\beta}_0)=\qty(\frac{1}{n}+\frac{\bar x^2}{nSS_{xx}})\sigma^2, \quad \operatorname{Var}(s^2)=\frac{2\sigma^4}{n-2}. \]

Click for proof.

\[ \operatorname{Var}(\hat{\beta}_1)=\frac{1}{SS_{xx}^2}\operatorname{Var}(SS_{xy})=\frac{1}{SS_{xx}^2}\operatorname{Var}\qty[\sum x_i(y_i-\bar{y})]=\frac{1}{SS_{xx}^2}SS_{xx}\sigma^2=\frac{\sigma^2}{SS_{xx}}. \]

Since \(\operatorname{Cov}(\bar y ,\hat\beta_1)=0\), we have \[ \begin{split} \operatorname{Var}(\hat{\beta}_0)&=\operatorname{Var}\qty(\bar y-\hat{\beta}_1\bar x)=\operatorname{Var}(\bar y)+\bar{x}^2\operatorname{Var}(\hat{\beta}_1)=\frac{\sigma^2}{n}+\bar x^2\frac{\sigma^2}{nSS_{xx}}=\qty(\frac{1}{n}+\frac{\bar x^2}{nSS_{xx}})\sigma^2. \end{split} \]

Since \(\frac{SSE}{\sigma^2}\sim\chi^2_{n-2}\), \(\var(\frac{SSE}{\sigma^2})=2(n-2)\). Therefore \[ \operatorname{Var}(s^2)=\operatorname{Var}(\frac{SSE}{n-2})=\frac1{(n-2)^2}\operatorname{Var}(SSE)=\frac1{(n-2)^2}(\sigma^2)^22(n-2)=\frac{2\sigma^4}{n-2}. \]

Note that these variance formulas involve \(\sigma^2\). They come from a theoretical (population) analysis, so they can depend on unknown parameters. If we want numerical estimates of these variances from data, we must replace the unknown quantities with estimators and rebuild the variance calculations accordingly (typically by plugging in \(s^2\) for \(\sigma^2\), and using the sample versions of any other unknown terms).

5.3 \(\beta_1\)

The most important question for \(\beta_1\) is whether it is \(0\). In other words, we want to know whether \(X\) and \(Y\) have a linear relation.

  • Null hypotheis: \(H_0: \beta_1=0\).
  • Alternative hypotheis: \(H_a: \beta_1\neq0\).

Since \(\hat{\beta}_1=\beta_1+\sum_{i=1}^n\qty(\dfrac{x_i-\bar x}{SS_{xx}})\varepsilon_i\) is normal, we could use a t-test.

\[ t=\frac{\hat{\beta}_1-0}{s_{\hat{\beta}_1}}=\frac{\hat{\beta}_1}{s/\sqrt{SS_{xx}}},\quad s^2=\frac{SSE}{n-2} \]

We mainly use a two-tailed test. Therefore we compute the corresponding p-value, which is the probability of getting a statistic at least as extreme as the observed value, assuming \(H_0\) is true. Let \(t_c=\frac{\hat\beta_1}{s/\sqrt{SS_{xx}}}\) be the observed test statistic. Then

\[ p\text{-value} = \Pr(\abs{t}>\abs{t_c}\mid H_0), \] where \(t\sim t_{n-2}\) under \(H_0\). We reject \(H_0\) if the p-value is smaller than or equal to \(\alpha\).

  • If we reject \(H_0\), there is sufficient statistical evidence at level \(\alpha\) to conclude that \(\beta_1\neq 0\). In this case, we say the linear relationship between \(X\) and \(Y\) is statistically significant.
  • If we fail to reject \(H_0\), we do not have enough evidence to determine whether \(\beta_1\) differs from \(0\). In this case, we say the relationship is not statistically significant.
Important

Failing to reject \(H_0\) does not mean \(\beta_1=0\). It means there is insufficient evidence, given the data and model, to conclude that \(\beta_1\neq0\).

5.3.1 Confidence interval

Since \(\operatorname{Var}(\hat\beta_1)=\frac{\sigma^2}{SS+{xx}}\approx\frac{s^2}{SS_{xx}}\), the standard deviation of \(\hat\beta_1\) can be estimated by \(s_{\hat\beta_1}=s/\sqrt{SS_{xx}}\). Then the \((1-\alpha)\times 100\%\) confidence interval for \(\beta_1\) is \[ CI=\hat\beta_1\pm t_{\alpha/2,n-2}\cdot \frac{s}{\sqrt{SS_{xx}}}. \] In the hypothesis test described above, the null hypothesis is rejected if the observed \(t_c\) is sufficiently extreme, that is,

\[ \abs{t_c}=\abs{\frac{\hat\beta_1}{s/\sqrt{SS_{xx}}}}>t_{\alpha/2, n-2}. \] Equivalently, \[ \abs{\hat\beta_1}>t_{\alpha/2, n-2}\cdot \frac{s}{\sqrt{SS_{xx}}}. \] Therefore it means that rejecting null hypothesis is the same as the confidence interval of \(\beta_1\) doesn’t contain 0.

5.4 Sum of squares

Theorem 5.5 (The fundamental ANOVA identity) Define

  • \(SSE=\sum (y_i-\hat y_i)^2\): sum of squared errors
  • \(SSR=\sum (\hat y_i-\bar y)^2\): sum of squares regression
  • \(SST=\sum (y_i-\bar y)^2\): sum of squares total (also denoted \(SS_{yy}\))

These quantities satisfy \[ SST = SSR+SSE. \]

Click for proof. \[ \begin{split} SSR&=\sum (\hat y_i-\bar y)^2=\sum (\hat\beta_0+\hat\beta_1x_i-\bar y)^2=\sum (\bar y-\hat\beta_1\bar x+\hat\beta_1x_i-\bar y)^2\\ &=\sum (\hat\beta_1(x_i-\bar x))^2=(\hat\beta_1)^2\sum (x_i-\bar x)^2\\ &=\frac{SS_{xy}^2}{SS_{xx}^2}SS_{xx}=\frac{SS_{xy}^2}{SS_{xx}},\\ SSE&=\sum ( y_i-\hat y_i)^2=\sum (y_i-\hat\beta_0-\hat\beta_1x_i)^2=\sum (y_i-\bar y+\hat\beta_1\bar x-\hat\beta_1x_i)^2\\ &=\sum ((y_i-\bar y)-\hat\beta_1(x_i-\bar x))^2=\sum (y_i-\bar y)^2-2\hat\beta_1(x_i-\bar x)(y_i-\bar y)+(\hat\beta_1(x_i-\bar x))^2\\ &=\sum (y_i-\bar y)^2-2\hat\beta_1\sum(x_i-\bar x)(y_i-\bar y)+(\hat\beta_1)^2\sum(x_i-\bar x)^2\\ &=SST-2\frac{SS_{xy}}{SS_{xx}}SS_{xy}+\qty(\frac{SS_{xy}}{SS_{xx}})^2SS_{xx}\\ &=SST-2\frac{SS_{xy}^2}{SS_{xx}}+\frac{SS_{xy}^2}{SS_{xx}}=SST-\frac{SS_{xy}^2}{SS_{xx}}\\ &=SST-SSR. \end{split} \] Therefore \[ SST=SSR+SSE. \]

These three qualities are essential in interepretating regression models.

  • SST measures the total variability in the response variable.
  • SSR measures the variation explained by the regression model.
  • SSE measures the unexplained (residual) variation.

SSR measures the variation explained by the regression model. Since this variation is captured by the fitted regression line, it is considered explained by the model. The total variance is described by SST. So we would like the portion that is explained by our model is as high as possible. In other words, we would like \(SSR/SST\) is as high as possible. Since it is very important, we assign a variable to it.

5.5 Coefficient of Determination

The coefficient of determination is defined as \[ r^2=\frac{SST-SSE}{SST}=1-\frac{SSE}{SST}=\frac{SSR}{SST} \] It represents the proportion of the total sample variability in \(𝑌\) that is explained by the linear relationship between \(𝑌\) and \(X\). Equivalently, we have the following statement.

Note\(r^2\) interpretation

About \(100(r^2)\%\) of the total variation in the sample \(y\)-values (measured by the total sum of squares about \(\bar y\)) can be explained by using \(X\) to predict \(Y\) in the simple linear regression model.

5.6 Coefficent of Correlation

Recall that the Pearson coefficent of correlation is defined as follows. \[ r=\frac{SS_{xy}}{\sqrt{SS_{xx}SS_{yy}}}. \]

Since \(\hat\beta_1=\frac{SS_{xy}}{SS_{xx}}\), we have

\[ r=\hat\beta_1\frac{\sqrt{SS_{xx}}}{\sqrt{SS_{yy}}}=\hat\beta_1\frac{s_x}{s_y} \] where \(s_x\) and \(s_y\) are the sample standard deviations of \(X\) and \(Y\). This suggests that \(r\) is the standarized slope, while \(\hat\beta_1\) dependes on the units of measurement of \(X\) and \(Y\).

This \(r\) and the previous \(r^2\) is related. \[ \abs{r}=\frac{SS_{xy}}{\sqrt{SS_{xx}SS_{yy}}}=\sqrt{\frac{SS_{xy}^2}{SS_{xx}SS_{yy}}}=\sqrt{\frac{SSR}{SST}}=\sqrt{r^2}. \] This is the reason we choose these notations. Note that I ignore the sign of \(r\) in this computation. When using \(\sqrt{r^2}\) to compute \(r\), we have to manually choose the sign for \(r\) based on other indicators.

Since \(r\) and \(\beta_1\) essentially describe the same linear relationship, we can also use \(r\) to perform the hypothesis test. We have \[ t=\frac{\hat\beta_1}{s/\sqrt{SS_{xx}}}=\frac{\hat\beta_1 \sqrt{SS_{xx}}}{s}=r \sqrt{\frac{SS_{yy}}{SSE/(n-2)}}=r\sqrt{n-2}/{\sqrt{1-r^2}} \]

5.7 The ANOVA table

Now that we have defined these sums of squares, they are typically organized into an ANOVA Table. This table provides a convenient summary for assessing the overall significance of the regression model using the F-test.

Source df Sum of Squares (SS) Mean Square (MS) F
Regression 1 SSR MSR = SSR / 1 F = MSR / MSE
Error n − 2 SSE MSE = SSE / (n − 2)
Total n − 1 SST

The F-test examines whether the regression model, as a whole, explains a significant portion of the variability in the response variable. In the case of simple linear regression which only contains one independent variable, the F-test and the t-test are equivalent. In particular, the F-statistic is exactly the square of the corresponding t-statistic:

\[ F=t^2. \]

We will discuss the F-test in more details when we move on to multiple linear regression (MLR).

5.8 Prediction interval and Confidence interval

When using a regression model, we need to quantify how uncertain our predictions are. There are two different types of intervals, depending on what we wish to estimate:

  • a confidence interval for the mean response \(\operatorname{E}(Y\mid X=x)\)
  • a prediction interval for a new individual observation \(Y\mid X=x\).

5.8.1 Estimate the variance

First a preliminary result is that \(\bar y\) and \(\hat\beta_1\) as random variables are uncorrelated (covariance is 0).

Lemma 5.1 \(\operatorname{Cov}(\bar y, \hat\beta_1)=0\).

Click for proof.

\[ \begin{split} \bar y&=\beta_0+\beta_1\bar x+\bar\varepsilon,\\ \hat\beta_1&=\frac{1}{SS_{xx}}\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)=\frac1{SS_{xx}}\sum_{i=1}^n(x_i-\bar x)y_i\\ &=\frac1{SS_{xx}}\sum_{i=1}^n(x_i-\bar x)(\beta_0+\beta_1x_i+\varepsilon_i)\\ &=\frac1{SS_{xx}}\sum_{i=1}^n(x_i-\bar x)(\beta_0+\beta_1x_i)+\frac1{SS_{xx}}\sum_{i=1}^n(x_i-\bar x)\varepsilon_i. \end{split} \] Since \(\beta_0+\beta_1\bar x\) and \(\frac1{SS_{xx}}\sum_{i=1}^n(x_i-\bar x)(\beta_0+\beta_1x_i)\) are considered as constants, we have \[ \begin{split} \operatorname{Cov}(\bar y,\hat\beta_1)&=\operatorname{Cov}(\bar\varepsilon, \sum_{i=1}^n(x_i-\bar x)\varepsilon_i)=\operatorname{Cov}(\sum_{j=1}^n\varepsilon_j, \sum_{i=1}^n(x_i-\bar x)\varepsilon_i)\\ &=\sum_{i,j=1}^n\operatorname{Cov}(\varepsilon_j, (x_i-\bar x)\varepsilon_i)=\sum_{i=1}^n\operatorname{Cov}(\varepsilon_i, (x_i-\bar x)\varepsilon_i)\\ &=\sum_{i=1}^n(x_i-\bar x)=0. \end{split} \]

Theorem 5.6 Let \(x\) be a fixed value.

  1. The variance of the estimated mean response is \[ \operatorname{Var}(\operatorname{E}(y))=\sigma^2\qty[\frac1n+\frac{(x-\bar x)^2}{S_{xx}}]. \]
  2. The variance of a new observation is \[ \operatorname{Var}(y)=\sigma^2\qty[1+\frac1n+\frac{(x-\bar x)^2}{S_{xx}}]. \]
Click for proof.

Since \(\bar y\) and \(\hat{\beta}_1\) are uncorrelated, we have \[ \begin{split} \operatorname{Var}(\operatorname{E}(y))&=\operatorname{Var}(\hat\beta_0+\hat\beta_1x)=\operatorname{Var}(\bar y+\hat\beta_1(x-\bar x))=\operatorname{Var}(\bar y)+(x-\bar x)^2\operatorname{Var}(\hat\beta_1)\\ &=\frac1n\sigma^2+(x-\bar x)^2\frac{\sigma^2}{SS_{xx}}\\ &=\sigma^2\qty[\frac1n+\frac{(x-\bar x)^2}{SS_{xx}}],\\ \operatorname{Var}(y)&=\operatorname{Var}(\operatorname{E}(y)+\varepsilon)=\operatorname{Var}(\operatorname{E}(y))+\operatorname{Var}(\varepsilon)=\sigma^2\qty[\frac1n+\frac{(x-\bar x)^2}{SS_{xx}}]+\sigma^2\\ &=\sigma^2\qty[1+\frac1n+\frac{(x-\bar x)^2}{S_{xx}}]. \end{split} \]

Therefore, these two variances lead to:

  • a confidence interval for \(\operatorname{E}(Y\mid X=x)\)
  • a prediction interval for a new \(Y\mid X=x\).

The prediction interval is always wider, because it includes both estimation uncertainty and irreducible error.