X <- as.matrix(dfX)
X <- cbind(1, X)
y <- as.matrix(dfY)
beta <- solve(t(X) %*% X) %*% t(X) %*% y8 Multiple Linear regression
\[ \require{physics} \require{braket} \]
\[ \newcommand{\dl}[1]{{\hspace{#1mu}\mathrm d}} \newcommand{\me}{{\mathrm e}} \newcommand{\argmin}{\operatorname{argmin}} \]
$$
$$
\[ \newcommand{\pdfbinom}{{\tt binom}} \newcommand{\pdfbeta}{{\tt beta}} \newcommand{\pdfpois}{{\tt poisson}} \newcommand{\pdfgamma}{{\tt gamma}} \newcommand{\pdfnormal}{{\tt norm}} \newcommand{\pdfexp}{{\tt expon}} \]
\[ \newcommand{\distbinom}{\operatorname{B}} \newcommand{\distbeta}{\operatorname{Beta}} \newcommand{\distgamma}{\operatorname{Gamma}} \newcommand{\distexp}{\operatorname{Exp}} \newcommand{\distpois}{\operatorname{Poisson}} \newcommand{\distnormal}{\operatorname{\mathcal N}} \]
8.1 General form
Definition 8.1 \[ y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k+\varepsilon \] where
- \(y\) is the dependent variable (response variable)
- \(x_1,\ldots,x_k\) are the independent variable
- \(E(y)=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k\) is the deterministic portion of the model
- \(\beta_i\) determines the contribution of the independent variable \(x_i\)
A main effect refers only to the original predictor before interactions or transformations. A variable is any quantity we use to help explain or predict the response.
- If \(x_2=x_1^2\), \(x_2\) is called a higher-order term.
- If \(x_3=x_1x_2\), \(x_3\) is called a interaction term.
- If \(x_4=1\) in some cases and \(0\) in some other cases, \(x_4\) is called a coded variable. This is an example of categorical variables in regression.
| Model | Name of the Model |
|---|---|
| \(y = \beta_0 + \beta_1 x_1 + \varepsilon\) | • Simple linear regression • Simple first-order model with one predictor |
| \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\) | • Multiple linear regression with three predictors • First-order model with three predictors |
| \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \varepsilon\) | • Interaction model with two predictors |
| \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \varepsilon\) | • Second-order model with one predictor |
| \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \varepsilon\) | • First-order model with five predictors |
| \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_2^2 + \beta_5 x_1 x_2 + \varepsilon\) | • Second-order model with two predictors |
The steps are very similar to simple linear regression.
- Collect data.
- Hypothesize the form of the model.
- Estimate the unknown parameters \(\beta_0,\ldots,\beta_k\).
- Specify the distribution of \(\varepsilon\) and estimate its \(\sigma^2\).
- Evaluate the utility of the model, like checking p-values and \(R^2\).
- Verify the model by checking the assumptions on \(\varepsilon\).
- Use the model. Analyze the prediction and make inferences.
\[ y=\underbrace{\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k}_{\text{Deterministic portion of model}}+\underbrace{\varepsilon}_{\text{Random error}}. \]
- For any given \(x_i\), \(\varepsilon\sim \distnormal(0,\sigma^2)\).
- The random errors are independent.
8.2 A first-order model with quantitative predictors
Definition 8.2 (A First-Order Model in \(k\) Quantitative Independent Variables) \[ E(y)=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k \] where \(x_1,\ldots,x_k\) are \(k\) quantitative independent variables.
Here independent means that these variables are not functions of other variables.
After inputing the data, we have the following equations.
\[ y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_kx_{ik}+\epsilon_i,\quad i=1,\ldots,n. \]
Write them in term of matrices:
\[ \mathbf{y}=\mqty[y_1\\\ldots\\y_n],\quad \mathbf{X}=\mqty[1&x_{11}&x_{12}&\ldots&x_{1k}\\\ldots&&&&\\1&x_{n1}&x_{n2}&\ldots&x_{nk}],\quad \boldsymbol{\beta}=\mqty[\beta_0\\\beta_1\\\ldots\\\beta_k],\quad \boldsymbol{\epsilon}=\mqty[\epsilon_1\\\ldots\\\epsilon_n]. \]
Here \(X\) is called a design matrix. It is simply the matrix that contains all predictor information:
- Rows: obervations
- Columns: variables (including intercept column of \(1\)’s’)
Then the model can be expressed as \[ \mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon} \]
The OLS estimator is used to minimize the square sum of residuals \[ (\mathbf y-\mathbf X\boldsymbol{\beta})^T(\mathbf y-\mathbf X\boldsymbol{\beta}). \]
Therefore \[ \boldsymbol{\hat{\beta}}=\argmin_{\beta}(\mathbf y-\mathbf X\boldsymbol{\beta})^T(\mathbf y-\mathbf X\boldsymbol{\beta}). \]
Theorem 8.1 (The OLS estimator) The following estimator is an unbiased estimator of \(\boldsymbol{\beta}\): \[ \boldsymbol{\hat{\beta}}=\mathbf{(X^T X)^{-1}X^T y} \] In addition, \[ \operatorname{Var}(\boldsymbol{\hat{\beta}})=\mathbf{(X^T X)}^{-1}\sigma^2. \]
When \(k=1\), we have
\[ \mathbf X=\mqty[1&1&\ldots&1\\x_1&x_2&\ldots&x_n],\quad \mathbf y=\mqty[y_1\\\ldots\\y_n]. \] Then
\[ \begin{split} \mathbf{X^T X}&=\mqty[1&1&\ldots&1\\x_1&x_2&\ldots&x_n]\mqty[1&x_1\\1&x_2\\\ldots&\\1&x_n]=\mqty[n&n\bar{x}\\n\bar x&\sum x_i^2],\\ \mathbf{X^T y}&=\mqty[1&1&\ldots&1\\x_1&x_2&\ldots&x_n]\mqty[y_1\\\ldots\\y_n]=\mqty[n\bar y\\\sum x_iy_i],\\ \mathbf{(X^T X)^{-1}X^T y}&=\mqty[n&n\bar{x}\\n\bar x&\sum x_i^2]^{-1}\mqty[n\bar y\\\sum x_iy_i]\\ &=\frac{1}{n(\sum x_i^2)-n^2\bar x^2}\mqty[\sum x_i^2&-n\bar x\\-n\bar x&n]\mqty[n\bar y\\\sum x_iy_i]\\ &=\frac{1}{n(\sum x_i^2)-n^2\bar x^2}\mqty[n(\sum x_i^2)\bar y-n\bar x(\sum x_iy_i)\\n(\sum x_iy_i)-n^2\bar x\bar y]\\ &=\frac{1}{(\sum x_i^2)-n\bar x^2}\mqty[(\sum x_i^2)\bar y-\bar x(\sum x_iy_i)\\(\sum x_iy_i)-n\bar x\bar y]\\ &=\frac{1}{\sum x_i^2-n\bar x^2}\mqty[(\sum x_i^2-n\bar x^2)\bar y+n\bar x^2\bar y-\bar x(\sum x_iy_i)\\\sum x_iy_i-n\bar x\bar y]\\ &=\frac{1}{\sum x_i^2-n\bar x^2}\mqty[(\sum x_i^2-n\bar x^2)\bar y-\bar x(\sum x_iy_i-n\bar x\bar y)\\\sum x_iy_i-n\bar x\bar y]. \end{split} \]
Note that
\[ \begin{split} S_{xx}&=\sum (x_i-\bar x)^2=\sum(x_i^2-2x_i\bar x+\bar x^2)=\sum x_i^2-2(\sum x_i)\bar x+n\bar x^2=\sum x_i^2-n\bar x^2,\\ % S_{yy}&=\sum (y_i-\bar y)^2=\sum(y_i^2-2y_i\bar y+\bar y^2)=\sum y_i^2-2(\sum y_i)\bar y+n\bar y^2=\sum y_i^2-n\bar y^2,\\ S_{xy}&=\sum (x_i-\bar x)(y_i-\bar y)=\sum(x_iy_i-x_i\bar y-y_i\bar x+\bar x\bar y)\\ &=\sum x_iy_i-(\sum x_i)\bar y-(\sum y_i)\bar x+n\bar x\bar y=\sum x_iy_i-n\bar x\bar y, \end{split} \]
We have
\[ \begin{split} \mathbf{(X^T X)^{-1}X^T y}&=\frac{1}{\sum x_i^2-n\bar x^2}\mqty[(\sum x_i^2-n\bar x^2)\bar y-\bar x(\sum x_iy_i-n\bar x\bar y)\\\sum x_iy_i-n\bar x\bar y]\\ &=\frac{1}{S_{xx}}\mqty[S_{xx}(\bar y-\bar x(S_{xy}))\\S_{xy}]=\mqty[\bar y-\bar x(S_{xy}/S_{xx})\\S_{xy}/S_{xx}]. \end{split} \]
Therefore in the case of \(k=1\), \(\hat{\beta}_1=S_{xy}/S_{xx}\) and \(\hat{\beta}_0=\bar y-\beta_1\bar x\).
In addition, the variance of \(\hat{\beta_1}\) is the (2,2)-entry of the variance-covariance matrix \(\hat{\boldsymbol{\beta}}\). Since
\[ \begin{split} \operatorname{Var}(\boldsymbol{\hat{\beta}})=\mathbf{(X^T X)}^{-1}\sigma^2=\frac{1}{n(\sum x_i^2)-n^2\bar x^2}\mqty[\sum x_i^2&-n\bar x\\-n\bar x&n]\sigma^2, \end{split} \] the (2,2)-entry is \[ \begin{split} \operatorname{Var}(\hat{\beta_1})&=\frac{n}{n(\sum x_i^2-n\bar x^2)}\sigma^2=\frac{\sigma^2}{S_{xx}}. \end{split} \]
8.3 Colinearity
In the above estimation, the independence of variables doesn’t affect the unbiasedness of the OLS estimators.
Lemma 8.1 \[ \operatorname{Var}(\hat\beta_j)\propto\frac{1}{1-R^2_j}. \]
Click for proof.
Let the model be \(y=X\beta+\varepsilon\). Let the j-th predictor column be \(x_j\) and \(X_{-j}\) be the matrix of all other predictor columns. Rewrite the design matrix as \[ X=\mqty[x_j&X_{-j}]. \]
Then \[ X^TX=\mqty[x_j^Tx_j&x_j^TX_{-j}\\X^T_{-j}x_j&X^T_{-j}X_{-j}]. \]
Consider \[ \mqty[A&B\\C&D]=\underbrace{\mqty[I&BD^{-1}\\0&I]}_{(I)}\underbrace{\mqty[A-BD^{-1}C&0\\C&D]}_{(II)}. \] Let \(A-BD^{-1}C=S\). Then \[ (I)^{-1}=\mqty[I&-BD^{-1}\\o&I],\quad (II)^{-1}=\mqty[S^{-1}&0\\-D^{-1}CS^{-1}&D^{-1}], \] so \[ \mqty[A&B\\C&D]^{-1}=(II)^{-1}(I)^{-1}=\mqty[S^{-1}&0\\-D^{-1}CS^{-1}&D^{-1}]\mqty[I&-BD^{-1}\\0&I]=\mqty[S^{-1}&-S^{-1}BD^{-1}\\-D^{-1}CS^{-1}&D^{-1}(CS^{-1}BD^{-1}+I)] \] Therefore the left upper corner of the inverse matrix is \(S^{-1}=(A-BD^{-1}C)^{-1}\). Then \[ (X^TX)^{-1}_{jj}=(x_j^Tx_j-x^T_jX_{-j}(X^T_{-j}X_{-j})^{-1}X^T_{-j}x_j)^{-1}. \] Let \(P_j=X_{-j}(X^T_{-j}X_{-j})^{-1}X^T_{-j}\). We have \[ \begin{split} \operatorname{Var}(\hat\beta_j)&=\mqty[(X^TX)^{-1}\sigma^2]_{jj}=(X^TX)^{-1}_{jj}\sigma^2\\ &=\frac{\sigma^2}{x_j^Tx_j-x^T_jP_jx_j}=\frac{\sigma^2}{x_j^T(I-P_j)x_j}. \end{split} \]
On the other side, use \(x_j=P_jX_{-j}+\varepsilon\). Then \(x_j^T(I-P_j)x_j=SSE_j\). Since \(1-\frac{SSE_j}{SST_j}=R_j^2\), we have \[ x_j^T(I-P_j)x_j = SST_j(1-R_j^2). \] So \[ \begin{split} \operatorname{Var}(\hat\beta_j)&=\frac{\sigma^2}{x_j^T(I-P_j)x_j}=\frac{\sigma^2}{SST_j}\frac{1}{1-R_j^2}\propto\frac{1}{1-R_j^2}. \end{split} \]
Definition 8.3 (The variance inflation factor) The variance inflation factor for the ith coefficient is \[ VIF_j=\frac{1}{1-R^2_j} \] where \(R_j^2\) is the \(R^2\) of regression produced by regressing \(x_j\) against other \(x_i\)’s.
So we would like to use other variables to represent \(x_j\). From the proof in Lemma 8.1, we know that \(x_j=P_jX_{-j}+\varepsilon_j\). If this fits well, \(R_j^2\) is close to 1. If not \(R^2_j\) is close to 0. In other words, if one variable is colinear to other variables, its VIF is supposed to be \(\infty\). If not, its VIF is supposed to be \(1\).
Now by Lemma 8.1, when \(R_j^2\) is large, \(VIF_j\) will be large. In other words, if one variable depends on other variables, which means that \(R_j^2\) is high, the variance of \(\hat\beta_j\) is also very high. Then the standard error of \(\hat\beta_j\) is large, and then the corresponding t-statistic is small, which leads to a large p-value, and therefore it is highly possible that the variable is not significant. In other words, if there are redundent info, an individual variable might be non-significant while the whole group is significant.
Usually if VIF is more than 5 or 10, multicollinearity may be a problem.
Also the interpretation of coefficients breaks. The basic interpretation of \(\beta_j\) is that “holding others fixed, a 1-uint increase in \(x_j\) will change \(y\) by \(\beta_j\)”. When strong multicollinearity exists, the interpretation “holding other variables fixed” becomes unrealistic because the predictors cannot vary independently in the observed data. As a result, individual regression coefficients no longer represent meaningful causal or marginal effects, even if the model still predicts well.
In MLR, the multicolinearity between independent variables doesn’t affect the unbiasedness of the OLS estimators. In other words, even if those independent variable are colinear, the OLS estimators is still BLUE (Best Linear Unbiased Estimators) to estimate \(\boldsymbol{\hat{\beta}}\) as long as other standard assumptions hold. However, the variance and the stability may be impacted, and this may impact the later inference.
- Multicollinearity inflates the variance of the estimated coefficients.
- Even if a variable is genuinely important, its coefficient might not be statistically significant due to high standard errors.
- The model might struggle to separate the effects of correlated predictors.
8.4 The analysis of variance
- \(SSE=\sum (y_i-\hat{y}_i)^2\).
- \(s^2=MSE=\frac{SSE}{n-(k+1)}\).
Hypothesis test:
Definition 8.4 (The Global \(F\)-Test)
- \(H_0\): all \(\beta_i\)’s are \(0\).
- \(H_a\): at least one \(\beta_i\) is not \(0\).
- \(F=\frac{MSR}{MSE}=\frac{(SST-SSE)/k}{SSE/[n-(k+1)]}\).
- \(p=\Pr(F>F_{\alpha})\).
Note that since \(F\)-statistics is always positive, we only need to compute one tail of it.
Definition 8.5 (The individual \(t\)-Test for variable \(\beta_i\))
- \(H_0\): \(\beta_i=0\).
- \(H_a\): \(\beta_i\neq 0\).
- \(t=\frac{\hat{\beta}_i}{s_{\hat{\beta}_i}}\).
- \(p=\Pr(\abs{t}>t_{\alpha})\).
Note that these \(\hat{\beta}_i\) and \(s_{\hat{\beta}_i}\) directly come from the OLS estimators.
- You have to pass F-test to proceed. If not, stop and modify the model.
- Try to minimize the number of t-tests. Too many t-tests leads to a high overall Type I error rate.
- When a parameter is not significant, there are some possibilities:
- \(y\) and \(x_i\) don’t have relations.
- They have a linear relation, but Type II error occurred.
- They have non-linear relations.
Example 8.1 (a simple R code example)
Click to expand.
In base R, we need the following command to perform manual operations:
as.matrixto convert adata.frameinto amatrix.%*%is for matrix multiplication.tis the transpose of a matrix.solveis used to compute the inverse of a matrix.cbindis to combine two matrices horizontally. (rbindis to combine matrices vertically, but we don’t need it here.)
Then once you have two dataframes dfX and dfY, the following code can be used to compute the OLS estimators.
To compute the variance, we should first estimate \(\sigma^2\).
yhat <- X %*% beta
SSE <- sum((y-yhat)^2)
n <- dim(X)[1]
k <- dim(X)[2] - 1
s2 <- SSE/(n-(k+1))
var_beta <- s2 * solve(t(X) %*% X)We could use diag to extract the diagonal to get variance and ignore all the covariance terms. Then apply sqrt to get the standard errors.
se_beta <- sqrt(diag(var_beta))From here, we could compute the corresponding \(t\)-tests and \(F\)-test, by the formula \(t=\frac{\hat{\beta}}{s_{\hat{\beta}}}\) and \(F=\frac{MSR}{MSE}\).
t <- beta / se_beta
MSR <- sum((y_hat-mean(y))^2) / k
MSE <- s2
F <- MSR/MSEAfter we get the statistics, we could conduct the corresponding test by checking the p-values.
p_t <- pt(abs(t), n-(k+1), lower.tail = FALSE) + pt(-abs(t), n-(k+1))
p_f <- pf(F, k, n-(k+1), lower.tail = FALSE)8.5 Evluation
8.5.1 \(R^2\)
Definition 8.6 (The multiple coefficient of determination) \[ R^2=1-\frac{SSE}{SST}. \] \(R^2\) represents the fraction of the sample variation that is explained by the model.
8.5.2 \(R_a^2\)
\(R^2\) is a “hard” measurement for the model performance. However since adding more predictors always increases \(R^2\) even if they are useless, we need a way to adjust for the number of predictors. This is the adjusted \(R^2\).
Definition 8.7 (The adjusted multiple coefficient of determination) \[ R^2_a=1-\qty[\frac{(n-1)}{n-(k+1)}]\qty(\frac{SSE}{SST})=1-\qty[\frac{(n-1)}{n-(k+1)}]\qty(1-R^2). \]
- \(R_a^2\leq R^2\).
- For poor-fitting models \(R_a^2\) can be negative.
- The practical interpretation for \(R_a^2\) is the fraction of the variance explained by the model adjusted for the sample size and the number of parameters.