8 Multiple Linear regression

8.1 General form

Definition 8.1 \[ y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k+\varepsilon \] where

\(y\) is the dependent variable (response variable)
\(x_1,\ldots,x_k\) are the independent variable
\(E(y)=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k\) is the deterministic portion of the model
\(\beta_i\) determines the contribution of the independent variable \(x_i\)

Note

A main effect refers only to the original predictor before interactions or transformations. A variable is any quantity we use to help explain or predict the response.

If \(x_2=x_1^2\), \(x_2\) is called a higher-order term.
If \(x_3=x_1x_2\), \(x_3\) is called a interaction term.
If \(x_4=1\) in some cases and \(0\) in some other cases, \(x_4\) is called a coded variable. This is an example of categorical variables in regression.

Model	Name of the Model
\(y = \beta_0 + \beta_1 x_1 + \varepsilon\)	• Simple linear regression • Simple first-order model with one predictor
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\)	• Multiple linear regression with three predictors • First-order model with three predictors
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \varepsilon\)	• Interaction model with two predictors
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \varepsilon\)	• Second-order model with one predictor
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \varepsilon\)	• First-order model with five predictors
\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_2^2 + \beta_5 x_1 x_2 + \varepsilon\)	• Second-order model with two predictors

The steps are very similar to simple linear regression.

Collect data.
Hypothesize the form of the model.
Estimate the unknown parameters \(\beta_0,\ldots,\beta_k\).
Specify the distribution of \(\varepsilon\) and estimate its \(\sigma^2\).
Evaluate the utility of the model, like checking p-values and \(R^2\).
Verify the model by checking the assumptions on \(\varepsilon\).
Use the model. Analyze the prediction and make inferences.

\[ y=\underbrace{\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k}_{\text{Deterministic portion of model}}+\underbrace{\varepsilon}_{\text{Random error}}. \]

Assumptions about \(\varepsilon\)

For any given \(x_i\), \(\varepsilon\sim \distnormal(0,\sigma^2)\).
The random errors are independent.

8.2 A first-order model with quantitative predictors

Definition 8.2 (A First-Order Model in \(k\) Quantitative Independent Variables) \[ E(y)=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_kx_k \] where \(x_1,\ldots,x_k\) are \(k\) quantitative independent variables.

Here independent means that these variables are not functions of other variables.

After inputing the data, we have the following equations.

\[ y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_kx_{ik}+\epsilon_i,\quad i=1,\ldots,n. \]

Write them in term of matrices:

\[ \mathbf{y}=\mqty[y_1\\\ldots\\y_n],\quad \mathbf{X}=\mqty[1&x_{11}&x_{12}&\ldots&x_{1k}\\\ldots&&&&\\1&x_{n1}&x_{n2}&\ldots&x_{nk}],\quad \boldsymbol{\beta}=\mqty[\beta_0\\\beta_1\\\ldots\\\beta_k],\quad \boldsymbol{\epsilon}=\mqty[\epsilon_1\\\ldots\\\epsilon_n]. \]

Here \(X\) is called a design matrix. It is simply the matrix that contains all predictor information:

Rows: obervations
Columns: variables (including intercept column of \(1\)’s’)

Then the model can be expressed as \[ \mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon} \]

The OLS estimator is used to minimize the square sum of residuals \[ (\mathbf y-\mathbf X\boldsymbol{\beta})^T(\mathbf y-\mathbf X\boldsymbol{\beta}). \]

Therefore \[ \boldsymbol{\hat{\beta}}=\argmin_{\beta}(\mathbf y-\mathbf X\boldsymbol{\beta})^T(\mathbf y-\mathbf X\boldsymbol{\beta}). \]

Theorem 8.1 (The OLS estimator) The following estimator is an unbiased estimator of \(\boldsymbol{\beta}\): \[ \boldsymbol{\hat{\beta}}=\mathbf{(X^T X)^{-1}X^T y} \] In addition, \[ \operatorname{Var}(\boldsymbol{\hat{\beta}})=\mathbf{(X^T X)}^{-1}\sigma^2. \]

SLR case

When \(k=1\), we have

\[ \mathbf X=\mqty[1&1&\ldots&1\\x_1&x_2&\ldots&x_n],\quad \mathbf y=\mqty[y_1\\\ldots\\y_n]. \] Then

\[ \begin{split} \mathbf{X^T X}&=\mqty[1&1&\ldots&1\\x_1&x_2&\ldots&x_n]\mqty[1&x_1\\1&x_2\\\ldots&\\1&x_n]=\mqty[n&n\bar{x}\\n\bar x&\sum x_i^2],\\ \mathbf{X^T y}&=\mqty[1&1&\ldots&1\\x_1&x_2&\ldots&x_n]\mqty[y_1\\\ldots\\y_n]=\mqty[n\bar y\\\sum x_iy_i],\\ \mathbf{(X^T X)^{-1}X^T y}&=\mqty[n&n\bar{x}\\n\bar x&\sum x_i^2]^{-1}\mqty[n\bar y\\\sum x_iy_i]\\ &=\frac{1}{n(\sum x_i^2)-n^2\bar x^2}\mqty[\sum x_i^2&-n\bar x\\-n\bar x&n]\mqty[n\bar y\\\sum x_iy_i]\\ &=\frac{1}{n(\sum x_i^2)-n^2\bar x^2}\mqty[n(\sum x_i^2)\bar y-n\bar x(\sum x_iy_i)\\n(\sum x_iy_i)-n^2\bar x\bar y]\\ &=\frac{1}{(\sum x_i^2)-n\bar x^2}\mqty[(\sum x_i^2)\bar y-\bar x(\sum x_iy_i)\\(\sum x_iy_i)-n\bar x\bar y]\\ &=\frac{1}{\sum x_i^2-n\bar x^2}\mqty[(\sum x_i^2-n\bar x^2)\bar y+n\bar x^2\bar y-\bar x(\sum x_iy_i)\\\sum x_iy_i-n\bar x\bar y]\\ &=\frac{1}{\sum x_i^2-n\bar x^2}\mqty[(\sum x_i^2-n\bar x^2)\bar y-\bar x(\sum x_iy_i-n\bar x\bar y)\\\sum x_iy_i-n\bar x\bar y]. \end{split} \]

Note that

\[ \begin{split} S_{xx}&=\sum (x_i-\bar x)^2=\sum(x_i^2-2x_i\bar x+\bar x^2)=\sum x_i^2-2(\sum x_i)\bar x+n\bar x^2=\sum x_i^2-n\bar x^2,\\ % S_{yy}&=\sum (y_i-\bar y)^2=\sum(y_i^2-2y_i\bar y+\bar y^2)=\sum y_i^2-2(\sum y_i)\bar y+n\bar y^2=\sum y_i^2-n\bar y^2,\\ S_{xy}&=\sum (x_i-\bar x)(y_i-\bar y)=\sum(x_iy_i-x_i\bar y-y_i\bar x+\bar x\bar y)\\ &=\sum x_iy_i-(\sum x_i)\bar y-(\sum y_i)\bar x+n\bar x\bar y=\sum x_iy_i-n\bar x\bar y, \end{split} \]

We have

\[ \begin{split} \mathbf{(X^T X)^{-1}X^T y}&=\frac{1}{\sum x_i^2-n\bar x^2}\mqty[(\sum x_i^2-n\bar x^2)\bar y-\bar x(\sum x_iy_i-n\bar x\bar y)\\\sum x_iy_i-n\bar x\bar y]\\ &=\frac{1}{S_{xx}}\mqty[S_{xx}(\bar y-\bar x(S_{xy}))\\S_{xy}]=\mqty[\bar y-\bar x(S_{xy}/S_{xx})\\S_{xy}/S_{xx}]. \end{split} \]

Therefore in the case of \(k=1\), \(\hat{\beta}_1=S_{xy}/S_{xx}\) and \(\hat{\beta}_0=\bar y-\beta_1\bar x\).

In addition, the variance of \(\hat{\beta_1}\) is the (2,2)-entry of the variance-covariance matrix \(\hat{\boldsymbol{\beta}}\). Since

\[ \begin{split} \operatorname{Var}(\boldsymbol{\hat{\beta}})=\mathbf{(X^T X)}^{-1}\sigma^2=\frac{1}{n(\sum x_i^2)-n^2\bar x^2}\mqty[\sum x_i^2&-n\bar x\\-n\bar x&n]\sigma^2, \end{split} \] the (2,2)-entry is \[ \begin{split} \operatorname{Var}(\hat{\beta_1})&=\frac{n}{n(\sum x_i^2-n\bar x^2)}\sigma^2=\frac{\sigma^2}{S_{xx}}. \end{split} \]

8.3 The analysis of variance

\(SSE=\sum (y_i-\hat{y}_i)^2\).
\(s^2=MSE=\frac{SSE}{n-(k+1)}\).

Hypothesis test:

Definition 8.3 (The Global \(F\)-Test)

\(H_0\): all \(\beta_i\)’s are \(0\).
\(H_a\): at least one \(\beta_i\) is not \(0\).
\(F=\frac{MSR}{MSE}=\frac{(SST-SSE)/k}{SSE/[n-(k+1)]}\).
\(p=\Pr(F>F_{\alpha})\).

Note that since \(F\)-statistics is always positive, we only need to compute one tail of it.

Definition 8.4 (The individual \(t\)-Test for variable \(\beta_i\))

\(H_0\): \(\beta_i=0\).
\(H_a\): \(\beta_i\neq 0\).
\(t=\frac{\hat{\beta}_i}{s_{\hat{\beta}_i}}\).
\(p=\Pr(\abs{t}>t_{\alpha})\).

Note that these \(\hat{\beta}_i\) and \(s_{\hat{\beta}_i}\) directly come from the OLS estimators.

Caution

You have to pass F-test to proceed. If not, stop and modify the model.
Try to minimize the number of t-tests. Too many t-tests leads to a high overall Type I error rate.
When a parameter is not significant, there are some possibilities:
- \(y\) and \(x_i\) don’t have relations.
- They have a linear relation, but Type II error occurred.
- They have non-linear relations.

Example 8.1 (a simple R code example)

Click to expand.

In base R, we need the following command to perform manual operations:

as.matrix to convert a data.frame into a matrix.
%*% is for matrix multiplication.
t is the transpose of a matrix.
solve is used to compute the inverse of a matrix.
cbind is to combine two matrices horizontally. (rbind is to combine matrices vertically, but we don’t need it here.)

Then once you have two dataframes dfX and dfY, the following code can be used to compute the OLS estimators.

X <- as.matrix(dfX)
X <- cbind(1, X)
y <- as.matrix(dfY)
beta <- solve(t(X) %*% X) %*% t(X) %*% y

To compute the variance, we should first estimate \(\sigma^2\).

yhat <- X %*% beta
SSE <- sum((y-yhat)^2)
n <- dim(X)[1]
k <- dim(X)[2] - 1
s2 <- SSE/(n-(k+1))
var_beta <- s2 * solve(t(X) %*% X)

We could use diag to extract the diagonal to get variance and ignore all the covariance terms. Then apply sqrt to get the standard errors.

se_beta <- sqrt(diag(var_beta))

From here, we could compute the corresponding \(t\)-tests and \(F\)-test, by the formula \(t=\frac{\hat{\beta}}{s_{\hat{\beta}}}\) and \(F=\frac{MSR}{MSE}\).

t <- beta / se_beta
MSR <- sum((y_hat-mean(y))^2) / k
MSE <- s2
F <- MSR/MSE

After we get the statistics, we could conduct the corresponding test by checking the p-values.

p_t <- pt(abs(t), n-(k+1), lower.tail = FALSE) + pt(-abs(t), n-(k+1)) 
p_f <- pf(F, k, n-(k+1), lower.tail = FALSE)

8.4 Evluation

8.4.1 \(R^2\)

Definition 8.5 (The multiple coefficient of determination) \[ R^2=1-\frac{SSE}{SST}. \] \(R^2\) represents the fraction of the sample variation that is explained by the model.

8.4.2 \(R_a^2\)

\(R^2\) is a “hard” measurement for the model performance. However since adding more predictors always increases \(R^2\) even if they are useless, we need a way to adjust for the number of predictors. This is the adjusted \(R^2\).

Definition 8.6 (The adjusted multiple coefficient of determination) \[ R^2_a=1-\qty[\frac{(n-1)}{n-(k+1)}]\qty(\frac{SSE}{SST})=1-\qty[\frac{(n-1)}{n-(k+1)}]\qty(1-R^2). \]

Note

\(R_a^2\leq R^2\).
For poor-fitting models \(R_a^2\) can be negative.
The practical interpretation for \(R_a^2\) is the fraction of the variance explained by the model adjusted for the sample size and the number of parameters.

8.5 Colinearity

In the above estimation, the independence of variables doesn’t affect the unbiasedness of the OLS estimators.

Lemma 8.1 \[ \operatorname{Var}(\hat\beta_j)\propto\frac{1}{1-R^2_j}. \]

Click for proof.

Let the model be \(y=X\beta+\varepsilon\). Let the j-th predictor column be \(x_j\) and \(X_{-j}\) be the matrix of all other predictor columns. Rewrite the design matrix as \[ X=\mqty[x_j&X_{-j}]. \]

Then \[ X^TX=\mqty[x_j^Tx_j&x_j^TX_{-j}\\X^T_{-j}x_j&X^T_{-j}X_{-j}]. \]

Consider \[ \mqty[A&B\\C&D]=\underbrace{\mqty[I&BD^{-1}\\0&I]}_{(I)}\underbrace{\mqty[A-BD^{-1}C&0\\C&D]}_{(II)}. \] Let \(A-BD^{-1}C=S\). Then \[ (I)^{-1}=\mqty[I&-BD^{-1}\\o&I],\quad (II)^{-1}=\mqty[S^{-1}&0\\-D^{-1}CS^{-1}&D^{-1}], \] so \[ \mqty[A&B\\C&D]^{-1}=(II)^{-1}(I)^{-1}=\mqty[S^{-1}&0\\-D^{-1}CS^{-1}&D^{-1}]\mqty[I&-BD^{-1}\\0&I]=\mqty[S^{-1}&-S^{-1}BD^{-1}\\-D^{-1}CS^{-1}&D^{-1}(CS^{-1}BD^{-1}+I)] \] Therefore the left upper corner of the inverse matrix is \(S^{-1}=(A-BD^{-1}C)^{-1}\). Then \[ (X^TX)^{-1}_{jj}=(x_j^Tx_j-x^T_jX_{-j}(X^T_{-j}X_{-j})^{-1}X^T_{-j}x_j)^{-1}. \] Let \(P_j=X_{-j}(X^T_{-j}X_{-j})^{-1}X^T_{-j}\). We have \[ \begin{split} \operatorname{Var}(\hat\beta_j)&=\mqty[(X^TX)^{-1}\sigma^2]_{jj}=(X^TX)^{-1}_{jj}\sigma^2\\ &=\frac{\sigma^2}{x_j^Tx_j-x^T_jP_jx_j}=\frac{\sigma^2}{x_j^T(I-P_j)x_j}. \end{split} \]

On the other side, use \(x_j=P_jX_{-j}+\varepsilon\). Then \(x_j^T(I-P_j)x_j=SSE_j\). Since \(1-\frac{SSE_j}{SST_j}=R_j^2\), we have \[ x_j^T(I-P_j)x_j = SST_j(1-R_j^2). \] So \[ \begin{split} \operatorname{Var}(\hat\beta_j)&=\frac{\sigma^2}{x_j^T(I-P_j)x_j}=\frac{\sigma^2}{SST_j}\frac{1}{1-R_j^2}\propto\frac{1}{1-R_j^2}. \end{split} \]

Definition 8.7 (The variance inflation factor) The variance inflation factor for the ith coefficient is \[ VIF_j=\frac{1}{1-R^2_j} \] where \(R_j^2\) is the \(R^2\) of regression produced by regressing \(x_j\) against other \(x_i\)’s.

So we would like to use other variables to represent \(x_j\). From the proof in Lemma 8.1, we know that \(x_j=P_jX_{-j}+\varepsilon_j\). If this fits well, \(R_j^2\) is close to 1. If not \(R^2_j\) is close to 0. In other words, if one variable is colinear to other variables, its VIF is supposed to be \(\infty\). If not, its VIF is supposed to be \(1\).

Now by Lemma 8.1, when \(R_j^2\) is large, \(VIF_j\) will be large. In other words, if one variable depends on other variables, which means that \(R_j^2\) is high, the variance of \(\hat\beta_j\) is also very high. Then the standard error of \(\hat\beta_j\) is large, and then the corresponding t-statistic is small, which leads to a large p-value, and therefore it is highly possible that the variable is not significant. In other words, if there are redundent info, an individual variable might be non-significant while the whole group is significant.

Usually if VIF is more than 5 or 10, multicollinearity may be a problem.

Note

Also the interpretation of coefficients breaks. The basic interpretation of \(\beta_j\) is that “holding others fixed, a 1-uint increase in \(x_j\) will change \(y\) by \(\beta_j\)”. When strong multicollinearity exists, the interpretation “holding other variables fixed” becomes unrealistic because the predictors cannot vary independently in the observed data. As a result, individual regression coefficients no longer represent meaningful causal or marginal effects, even if the model still predicts well.

8.5.1 Dependence and Multicolinearity

If \(X\) is a random variable, \(X^2\) depends on \(X\) completely. However, the ``linear relation’’ between \(X\) and \(X^2\) is not obvious. Here the key point is to consider the \(R^2\) when regressing \(X^2\) on \(X\): \(X^2=\beta_0+\beta_1X+\varepsilon\).

Note that \[ R^2=\operatorname{cor}(X^2,X)^2=\frac{\operatorname{Cov}(X^2, X)^2}{\operatorname{Var}(X^2)\operatorname{Var}(X)}=\frac{(\operatorname{E}[X^3]-\operatorname{E}[X]\operatorname{E}[X^2])^2}{\operatorname{Var}(X^2)\operatorname{Var}(X)}. \]

This result depends on the distribution of \(X\). For example, if the distribution of \(X\) is symmetric, then \(\operatorname{E}[X]=0\) and \(\operatorname{E}[X^3]=0\). Therefore \(R^2(X^2\sim X)=0\).

Example 8.2 (Normal distribution)

Click to expand.

Assume \(X\sim N(\mu,\sigma^2)\). If \(\mu=0\), then the distribution of \(X\) is symmetric, and so \(R^2(X^2\sim X)=0\), and therefore \(VIF(X^2\sim X)=1\).

More generally when \(\mu\) can be \(0\), \[ \begin{split} \operatorname{E}[X]&=\mu,\\ \operatorname{E}[X^2]&=\sigma^2+\mu^2,\\ \operatorname{E}[X^3]&=\mu^3+3\mu\sigma^2,\\ \operatorname{E}[X^4]&=\mu^4+6\mu^2\sigma^2+3\sigma^4,\\ \operatorname{Cov}(X,X^2)&=\operatorname{E}[X^3]-\operatorname{E}[X]\operatorname{E}[X^2]=(\mu^3+3\mu\sigma^2)-\mu(\sigma^2+\mu^2)=2\mu\sigma^2,\\ \operatorname{Var}[X]&=\sigma^2,\\ \operatorname{Var}[X^2]&=\operatorname{E}[X^4]-\operatorname{E}[X^2]^2=(\mu^4+6\mu^2\sigma^2+3\sigma^4)-(\sigma^2+\mu^2)^2=4\mu^2\sigma^2+2\sigma^4,\\ \operatorname{cor}(X,X^2)&=\frac{2\mu}{\sqrt{4\mu^2+2\sigma^2}}. \end{split} \] So \(R^2(X^2\sim X)=\frac{2\mu^2}{2\mu^2+\sigma^2}\). Then the corresponding VIF is \(VIF(X^2\sim X)=1+2\frac{\mu^2}{\sigma^2}\).

Example 8.3 (Exponential distribution)

Click to expand.

Let \[ X \sim \mathrm{Exp}(\lambda). \]

The moments are \[ \begin{split} \operatorname{E}[X] &= \frac{1}{\lambda},\\ \operatorname{E}[X^2] &= \frac{2}{\lambda^2},\\ \operatorname{E}[X^3] &= \frac{6}{\lambda^3},\\ \operatorname{E}[X^4] &= \frac{24}{\lambda^4}. \end{split} \]

Compute the required quantities.

\[ \operatorname{Var}(X)=\frac{1}{\lambda^2}. \]

\[ \begin{split} \operatorname{Var}(X^2) &=\operatorname{E}[X^4]-\operatorname{E}[X^2]^2\\ &=\frac{24}{\lambda^4}-\frac{4}{\lambda^4}\\ &=\frac{20}{\lambda^4}. \end{split} \]

\[ \begin{split} \operatorname{Cov}(X,X^2) &=\operatorname{E}[X^3]-\operatorname{E}[X]\operatorname{E}[X^2]\\ &=\frac{6}{\lambda^3}-\frac{1}{\lambda}\cdot\frac{2}{\lambda^2}\\ &=\frac{4}{\lambda^3}. \end{split} \]

Now compute \(R^2\):

\[ \begin{split} R^2 &=\frac{\operatorname{Cov}(X,X^2)^2}{\operatorname{Var}(X)\operatorname{Var}(X^2)}\\ &=\frac{(4/\lambda^3)^2}{(1/\lambda^2)(20/\lambda^4)}\\ &=\frac{16}{20}\\ &=0.8. \end{split} \]

Thus \(R^2(X^2\sim X)=0.8\). Therefore \(VIF(X^2\sim X)=5\).

Example 8.4 (Uniform distribution)

Click to expand.

Let \[ X \sim \mathrm{Unif}(0,1). \]

The moments are \[ \begin{split} \operatorname{E}[X] &= \frac12,\\ \operatorname{E}[X^2] &= \frac13,\\ \operatorname{E}[X^3] &= \frac14,\\ \operatorname{E}[X^4] &= \frac15. \end{split} \]

Compute the required quantities.

\[ \begin{split} \operatorname{Var}(X) &=\operatorname{E}[X^2]-\operatorname{E}[X]^2\\ &=\frac13-\frac14\\ &=\frac{1}{12}. \end{split} \]

\[ \begin{split} \operatorname{Var}(X^2) &=\operatorname{E}[X^4]-\operatorname{E}[X^2]^2\\ &=\frac15-\frac19\\ &=\frac{4}{45}. \end{split} \]

\[ \begin{split} \operatorname{Cov}(X,X^2) &=\operatorname{E}[X^3]-\operatorname{E}[X]\operatorname{E}[X^2]\\ &=\frac14-\frac12\cdot\frac13\\ &=\frac{1}{12}. \end{split} \]

Now compute \(R^2\):

\[ \begin{split} R^2 &=\frac{\operatorname{Cov}(X,X^2)^2}{\operatorname{Var}(X)\operatorname{Var}(X^2)}\\ &=\frac{(1/12)^2}{(1/12)(4/45)}\\ &=\frac{1}{12}\cdot\frac{45}{4}\\ &=\frac{15}{16}. \end{split} \]

Thus \(R^2(X^2\sim X)=\frac{15}{16}=0.9375\). Then \(VIF(X^2\sim X)=16\).

Here is a table of common distributions.

Distribution	Parameters	\(R^2(X^2\sim X)\)	\(\mathrm{VIF}(X^2\sim X)\)
Normal	\(N(0,\sigma^2)\)	\(0\)	\(1\)
Normal	\(N(\mu,\sigma^2)\)	\(\dfrac{2\mu^2}{2\mu^2+\sigma^2}\)	\(1+\dfrac{2\mu^2}{\sigma^2}\)
Uniform	\(\mathrm{Unif}(-a,a)\)	\(0\)	\(1\)
Uniform	\(\mathrm{Unif}(0,1)\)	\(\dfrac{15}{16}=0.9375\)	\(16\)
Exponential	\(\mathrm{Exp}(\lambda)\)	\(\dfrac45=0.8\)	\(5\)
Gamma	\(\Gamma(k,\lambda)\)	\(\dfrac{2k}{2k+1}\)	\(2k+1\)
Chi-square	\(\chi^2_k\)	\(\dfrac{2k}{2k+1}\)	\(2k+1\)
t	\(t_\nu\)	\(0\)	\(1\)

This shows that for a nonlinear relation such as \(X^2\), a linear regression may sometimes capture a large portion of the variation, while in other cases it may capture almost nothing.

Note

One common trick to reduce multicollinearity is to center the variable: \[ z = x - \bar x. \] Then \(z\) is centered around \(0\), which often makes \(z\) and \(z^2\) less linearly dependent than \(x\) and \(x^2\). This transformation can reduce collinearity among predictors, while leaving the fitted values, residuals, and \(R^2\) of the model unchanged.

8.6 Interpretations of coefficients

A clear way to understand regression coefficients is through effects.

Definition 8.8

The effect of variable \(x\) is the partial derivative of the expected response with respect to \(x\): \[ \text{effect of }x=\pdv{\operatorname{E}(y\mid \vb x)}{x}. \] Equivalently, it measures how much \(\operatorname{E}(y)\) changes when \(x\) increases by \(1\), holding other variables fixed.
The main effect of a variable is its coefficient in the regression model.
If the model contains no interaction or nonlinear terms, the main effect equals the effect.

When interactions are present, however, the coefficient is not the effect. The coefficient equals the effect only when the interacting variable equals \(0\).

8.6.1 Interation term

When deciding which interaction terms to consider, a common idea is to ask whether the influence of one variable on the response may change depending on another variable. Such interactions are often reasonable when two predictors describe related aspects of the same phenomenon, when scientific knowledge suggests a combined effect, or when graphical exploration shows that the slope of one predictor appears different at different levels of another predictor. In these situations, adding an interaction term allows the model to capture how the relationship between a predictor and the response varies across conditions.

Consider the model

\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2+\varepsilon. \]

Since \[ \frac{\partial E(y)}{\partial x_1}=\beta_1+\beta_3x_2, \qquad \frac{\partial E(y)}{\partial x_2}=\beta_2+\beta_3x_1, \] we obtain the following interpretations:

The effect of \(x_1\) depends on the value of \(x_2\), and vice versa.
when \(x_2\) increases by 1, the slope of \(x_1\) increases by \(\beta_3\)
when \(x_1\) increases by 1, the slope of \(x_2\) increases by \(\beta_3\)

Interaction term

The interaction term \(\beta_3x_1x_2\) means the effect (slope) of one predictor changes as the other predictor changes.

Equivalently, \[ \beta_3=\frac{\partial^2 \operatorname{E}(y)}{\partial x_1\partial x_2}, \] so \(\beta_3\) measures how quickly one slope changes with the other predictor.

We may still write \[ \beta_1=\text{effect of }x_1\text{ when }x_2=0. \]

However in many cases, \(x_2=0\) is not meaningful in practice. To obtain a more useful interpretation, we center the variable: \[ x_2^*=x_2-\bar x_2,\quad \beta_0^*=\beta_0+\beta_2\bar x_2,\quad \beta_1^*=\beta_1+\beta_3\bar x_2. \]

Then the orginal model becomes \[ \operatorname{E}(y)=\beta_0^*+\beta_1^*x_1+\beta_2x_2^*+\beta_3x_1x_2^*. \] In this case, \(x_2^*\) can be \(0\). Then main effects represent slopes at the average level of the other predictor. We call \(\beta_1^*\) centralized main effect.

Main effect

After centering, the main effect of \(x_1\) represents its effect at the average level of the other predictor.

8.6.2 Higher-degree (polynomial) terms

When considering higher-degree terms, we ask whether the relationship between a predictor and the response shows curvature. If exploratory plots or prior knowledge suggest a nonlinear pattern, polynomial terms can be added to allow the model to capture this curvature.

Definition 8.9 A higher-order term is a nonlinear transformation of a predictor, such as \(x^2\), \(x^3\), \(\sqrt{x}\), or \(\log x\).

Note

Recall that the effect of a variable \(x\) is the rate of change of the expected response with respect to \(x\): \[ \text{effect of }x=\pdv{\operatorname{E}(y\mid \vb x)}{x}. \] With higher-degree terms, the coefficient of \(x\) is generally not the effect of \(x\); the effect depends on the value of \(x\).

Example 8.5 Consider

\[ y=\beta_0+\beta_1x+\beta_2x^2+\varepsilon. \]

Since \[ \dv{E(y)}{x}=\beta_1+2\beta_2x, \]

we see:

slope is not constant
slope changes linearly with \(x\)

The term \(x^2\) captures curvature:

If \(\beta_2>0\): concave up
If \(\beta_2<0\): concave down

Term Type	Meaning	Effect on Slope
Interaction \(x_1x_2\)	effect varies across variables	slope changes across another variable
Higher-degree \(x^2\)	curvature within one variable	slope changes across same variable

8.7 Formula language

Formula language was introduced by John Chambers based on the Wilkinson–Rogers notation. It is a formal symbolic system used to express mathematical relationships between variables. It provides a structured way to represent models, equations, and logical relationships in mathematics, statistics, etc..

Instead of describing relationships in words, formula language expresses them precisely using symbols.

Notation	Meaning	Statistical Equivalent and Examples
`A + B`	Main Effects	\(y = \beta_0 + \beta_1 A + \beta_2 B\) (Independent effects)
`A : B`	Interaction Term	\(y = \beta_0 + \beta_1 (AB)\) (Only the interaction)
`A * B`	Full Factorial	\(y = \beta_0 + \beta_1 A + \beta_2 B + \beta_3 (AB)\)
`(A + B + C)^2`	Degree Limit	All main effects and all possible 2-way interactions
`I()`	Math Identity	Wraps arithmetic: \(I(A^2)\) or \(I(A + B)\) as a single term
`1`	Intercept	The constant \(\beta_0\) (included by default)
`0 or -1`	No Intercept	Removes \(\beta_0\), forcing the model through the origin
`.`	All Variables	Shorthand for “all other columns in the data frame”
`-`	Exclusion	Removes a specific term (e.g., `y ~ . - x1`)

Note

In R formulas, symbols like *, ^, and + have special meanings. To use them as standard math operators, you must wrap them in I(). For example

Quadratic Model: y ~ x + I(x^2)
Summation as Predictor: y ~ I(x1 + x2)

Note

In R formula, higher degree terms without I() will be ignored. So y ~ x^2 will be the same as y ~ x. Another example is y ~ (x1+x2)^2, which is the same as y ~ x1 * x2 since x1^2 and x2^2 are ignored.