9 ANOVA

ANOVA stands for Analysis of Variance. It is a fundamental diagnostic and inferential tool in regression analysis. The basic idea of ANOVA is to use F tests to assess whether a model or components of a model explains a statistically significant amount of variability in the response variable.

In simple linear regression, the ANOVA table is unique and unambiguous, because there is only one predictor and hence only one way to attribute explained variability. In multiple linear regression, however, predictors may be correlated, and their contributions to the model are no longer uniquely defined. As a result, different conventions have been developed to allocate the explained variability among predictors. These conventions lead to different types of ANOVA tables, commonly referred to as Type I, Type II, and Type III ANOVA. The three types differ in how they account for the presence of other predictors and, when applicable, interaction terms in the model.

More generally, ANOVA is a framework for testing whether a model explains a non-trivial amount of variation in a response variable. Conceptually, it asks whether the reduction in unexplained variability achieved by adding terms to a model is large relative to random noise. Operationally, it answers this by comparing mean squares (variance estimates) using an F-statistic.

ANOVA is built on orthogonal projection in the data space. Fitting a model corresponds to projecting the response vector $y$ onto the model subspace, which yields the fundamental decomposition of total variability into explained and unexplained components:

\[ \underbrace{\norm{y-\bar y}^2}_{\text{SST}}=\underbrace{\norm{\hat y-\bar y}^2}_{\text{SSR}}+\underbrace{\norm{y-\hat y}^2}_{\text{SSE}} \] This identity is the mathematical foundation of all ANOVA tables.

9.1 Nested ANOVA

This is a test for comparing nested models.

Definition 9.1 (Nested models) Two models are nested if one model contains all the terms of the second model and at least one additional term.

The more complex model is called the complete (or full) model.
The simpler model is called the reduced (or restricted) model.

Here the main question is whether the additional terms are really necessary. We use a Hypothesis test to answer the question.

F-Test for comparing nested models

Reduced model: $E(y)=\beta_0+\beta_1x_1+\ldots+\beta_g x_g$
Complete model: $E(y)=\beta_0+\beta_1x_1+\ldots+\beta_g x_g+\beta_{g+1}x_{g+1}+\ldots+\beta_{k}x_k$
$H_0: \beta_{g+1}=\ldots=\beta_k=0$.
$H_a$: at least one of the $\beta_{g+1},\ldots,\beta_k$ is nonzero.
$\displaystyle F=\frac{(SSE_R-SSE_C)/(k-g)}{SSE_C/[n-(k+1)]}$, where $k$ is the number of full model predictors, and $g$ is the number of reduced model predictors.
Note that the above denominator $SSE_C/[n-(k+1)]=MSE$.

The test is usually done with the nested ANOVA table.

Model	Residual DF	Sum of Squares	DF Difference	Extra Sum of Squares	F Statistic	p-value
Reduced Model	$n - (g+1)$	$SSE_R$
Full Model	$n - (k+1)$	$SSE_C$	$k-g$	$SSE_R - SSE_C$	$\dfrac{(SSE_R - SSE_C)/(k-g)}{SSE_C/(n - (k+1))}$

In this table, each row is about a model. The first row is about the reduced model and the second row is about the complete model.

Res.Df is the degree of freedom of residuals. It is $n-(p+1)$ where $p$ is the number of predictors.
Since there are $g$ variables in the reduced model, and $k$ variables in the complete model, Res.Df of each model is $n-(g+1)$ and $n-(k+1)$.
The third column is the difference of Res.Df. Therefore it is $[n-(g+1)]-[n-(k+1)]=k-g$.
The forth column is the difference of $SSE$;
By the above definition, the F-statistic is $\displaystyle F=\frac{(SSE_R-SSE_C)/(k-g)}{SSE_C/[n-(k+1)]}$.
The p-value is gotten by the F-test.

In R, we could use R code anova(fit.reduced, fit.full) to directly generate the above table.

Example 9.1

Click to expand.

x1 <- runif(100)
x2 <- runif(100)
y <- 1 + 2 * x1 + 3 * x2 + rnorm(100, 0, 1)
model12 <- lm(y ~ x1 + x2)
model1 <- lm(y ~ x1)

anova(model1, model12)

Analysis of Variance Table

Model 1: y ~ x1
Model 2: y ~ x1 + x2
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     98 184.01                                  
2     97  90.86  1    93.148 99.443 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this table, each row is about a model. The first row is about the reduced model and the second row is about the full model.

In each model, Res.Df is the degree of freedom of residuals, which is $n-(p+1)$ where $p$ is the number of predictors.
In this example, Res.Df is $100-(1+1)=98$ and $100-(1+2)=97$.
The difference of degree of freedom is $2-1=1$.
The difference of $SSE$ is 184.0083576-90.8599433=93.1484143.
Then the F-statistic is (93.1484143/1)/(90.8599433/97)=99.4431194.
The p-value is gotten by the F-test.

Note

This nested model is the basis of all ANOVA tables listed below. All ANOVA tables in regression are based on comparing nested models. Every row in an ANOVA table corresponds to testing whether a set of parameters can be removed from a larger model.

9.2 ANOVA table for linear regression

This is the anova table introduced in the previous lectures. The main purpose is to show to decomposition and compute the F-statistic as well as the corresponding p-value.

\[ F=\frac{MSR}{MSE}=\frac{\text{variance explained per parameter}}{\text{variance unexplained per observation}}. \]

It is about the whole group of variables.

9.3 Type I ANOVA table

Type I ANOVA Table (Sequential Sum of Squares) decomposes the total variation in the response by adding predictors to the model sequentially, one at a time, in the order they appear in the model formula.

Suppose we fit the linear model \[ y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_px_p+\varepsilon. \]

The Type I ANOVA table reports \[ SS(x_j\mid x_1,\ldots, x_{j-1})=SSE(\text{model with }x_1,\ldots,x_{j-1})-SSE(\text{model with }x_1,\ldots,x_j). \]

That is, each sum of squares measures the reduction in unexplained variability obtained by adding $x_j$ to a model that contains the preceding predictors.

Source	Degrees of Freedom	Sum of Squares	Mean Square	F
$x_1$	1	$SS(x_1)$	$MS(x_1)$	$F_1$
$x_2$	1	$SS(x_2 \mid x_1)$	$MS(x_2)$	$F_2$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
Residuals	$n-p-1$	SSE	MSE

Each F-statistic tests

\[ H_0: \beta_j=0\quad \text{ given that $x_1,\ldots,x_{j-1}$ are already in the model.} \]

The corresponding F-statistic is computed by \[ F_j=\frac{MS(x_j)}{MSE}. \]

In summary, the Type I ANOVA table answers the question:

How much additional variation does this variable explain when added at this point in the model?

Tip

Type I ANOVA table depends on the order of variables.
Each row corresponds to a nested-model comparison.
Equivalent to anova(model1, model2) for successive models.
Matches classical ANOVA in balanced designs.

Example 9.2 We first generate a dataset, with three correlated predictors $x_1$, $x_2$ and $x_3$. The correlation structure is intentional, so that the sequential nature of Type I ANOVA is visible.

set.seed(123)

n <- 80
x1 <- rnorm(n)
x2 <- 0.8 * x1 + rnorm(n, sd = 0.6)
x3 <- 0.6 * x2 + rnorm(n, sd = 0.5)
y <- 3 + 2 * x1 + 1 * x2 +3*x3+rnorm(n, sd = 1)

Now fit the model and show the Type I ANOVA table. For Type I ANOVA table, we could use the build-in R function anova().

model123 <- lm(y~x1+x2+x3)
anova(model123)

Analysis of Variance Table

Response: y
          Df  Sum Sq Mean Sq F value    Pr(>F)    
x1         1 1336.61 1336.61 1333.29 < 2.2e-16 ***
x2         1  275.19  275.19  274.51 < 2.2e-16 ***
x3         1  115.49  115.49  115.20 < 2.2e-16 ***
Residuals 76   76.19    1.00                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because Type I ANOVA is sequential, each row corresponds to a nested-model F test.

$x_1$: Tests adding $x_1$ to the intercept-only model. Since the p-value is small, we conclude that $x_1$ explains a significant amount of variation in $y$.
$x_2$: Tests adding $x_2$ to a model that already contains $x_1$. Since the p-value is small, we conclude that $x_2$ contributes information beyond $x_1$.
$x_3$: Tests adding $x_3$ to the model with $x_1$ and $x_2$. Since the p-value is small, we conclude that $x_3$ explains additional variation not already captured by $x_1$ and $x_2$.

Example 9.3 We now illustrate an example that the order of variables matters. We generate a dataset with two highly correlated variables $x_1$ and $x_2$, where the response $y$ is constructed to depend only on $x_1$.

set.seed(123)

n <- 100
x1 <- rnorm(n)
x2 <- x1 + rnorm(n, sd = 0.2)

y <- 2 + 3 * x1 + rnorm(n, sd = 1)

model12 <- lm(y~x1+x2)
anova(model12)

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq  F value Pr(>F)    
x1         1 677.37  677.37 748.5025 <2e-16 ***
x2         1   0.05    0.05   0.0579 0.8104    
Residuals 97  87.78    0.90                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When $x_1$ is added first, it explains most of the variation in $y$. Because $x_2$ is largely redundant with $x_1$, adding $x_2$ afterward does not produce a significant additional reduction in the residual sum of squares. As a result, $x_1$ is significant, while $x_2$ is not.

model21 <- lm(y~x2+x1)
anova(model21)

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x2         1 650.19  650.19 718.462 < 2.2e-16 ***
x1         1  27.24   27.24  30.098 3.272e-07 ***
Residuals 97  87.78    0.90                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When $x_2$ is added first, it captures much of the variation in $y$ due to its strong correlation with $x_1$, and therefore appears significant. However, because $x_2$ does not fully explain $x_1$, there remains variation in $y$ that is uniquely attributable to $x_1$. Consequently, $x_1$ is still significant even after $x_2$ has been included in the model.

9.4 Type II ANOVA table

A Type II ANOVA table (marginal sum of squares) tests each term after adjusting for all other main effects, but not for interaction terms that contain that effect. In practice, the sums of squares for main effects are computed from the additive model (without interactions), while the interaction terms are tested by comparing the additive model with the model that includes the interaction.

Suppose the full model is

\[ y \sim x_1 + x_2 + x_1:x_2 . \]

9.4.1 Testing a main effect

To test the main effect ($x_i$), the Type II sum of squares is obtained from the additive model

\[ y \sim x_1 + x_2 . \]

Specifically, \[ SS^{(II)}(x_1)= SSE(x_2)-SSE(x_1 + x_2). \]

Similarly,

\[ SS^{(II)}(x_2)=SSE(x_1)-SSE(x_1 + x_2). \]

Thus, the main effects are evaluated after adjusting for the other main effects, but ignoring the interaction term.

9.4.2 Testing an interaction

For the interaction ($x_1x_2$), the models compared are

Full model

\[ y \sim x_1 + x_2 + x_1:x_2 \]

Reduced model

\[ y \sim x_1 + x_2. \]

Therefore,

\[ SS^{(II)}(x_1:x_2)=SSE(x_1 + x_2)-SSE(x_1 + x_2 + x_1:x_2). \]

That is, it measures the unique contribution of $x_j$ after accounting for all other predictors.

9.4.3 F-test

Source	Degrees of Freedom	Sum of Squares	Mean Square	F
$x_1$	$1$	$SS^{(II)}(x_1)$	$MS^{II}(x_1)$	$F_1$
$x_2$	$1$	$SS^{(II)}(x_2)$	$MS^{II}(x_2)$	$F_2$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
Residuals	$n - p - 1$	$\mathrm{SSE}$	$\mathrm{MSE}$

Each F-statistic tests \[ H_0: \beta_j=0\quad\text{ adjusting for all other main effect predictors.} \]

The corresponding F-statistic \[ F_j^{(II)}=\frac{MS^{(II)}(x_j)}{MSE}=\frac{SS^{^{II}}(x_j)/df_j}{SSE/(n-p-1)}. \]

Type II ANOVA table asks the following question:

Does this term improve the model after adjusting for the other main effects (when ignoring the interaction)?

Tip

Reordering predictors does not change the table.
Each effect is tested conditional on all other main effects.
If interactions are present, Type II generally does not give a main-effect test that ignores its interactions; in practice you either (i) interpret via simple effects, or (ii) use Type III with a clear coding convention.
Each row corresponds to a nested-model comparison.

Example 9.4

Click to expand.

We first generate a dataset with two correlated predictors $x_1$ and $x_2$.

set.seed(123)

n <- 80
x1 <- rnorm(n)
x2 <- 0.8 * x1 + rnorm(n, sd = 0.6) 
y <- 3 + 2 * x1 + 1 * x2 + rnorm(n, sd = 1)

For Type II ANOVA table, we could use the function Anova from car library.

library(car)

Loading required package: carData

model <- lm(y~x1+x2)
Anova(model, type=2)

Anova Table (Type II tests)

Response: y
          Sum Sq Df F value    Pr(>F)    
x1        71.255  1  77.566 2.823e-13 ***
x2        37.255  1  40.554 1.271e-08 ***
Residuals 70.735 77                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

$x_1$: Compares the full model $y \sim x_1 + x_2$ to the reduced model $y \sim x_2$ (i.e., the model with $x_1$ removed). A small p-value implies that $x_1$ explains a significant amount of variation in $y$ beyond what is explained by $x_2$.
$x_2$: Compares the full model $y \sim x_1 + x_2$ to the reduced model $y \sim x_1$ (i.e., the model with $x_2$ removed). A small p-value implies that $x_2$ explains a significant amount of variation in $y$ beyond what is explained by $x_1$.

Unlike Type I ANOVA, Type II ANOVA is order-invariant—reordering $x_1$ and $x_2$ does not change the results.

9.5 Type III ANOVA Table

A Type III ANOVA table (Fully Adjusted / Coefficient-Based Tests) tests each model term in the presence of all other terms. Equivalently, for a given term, it compares the full model to a reduced model obtained by setting the coefficients associated with that term to zero while keeping all other terms in the model. Unlike Type I ANOVA, the allocation of variability does not depend on the order of predictors in the model.

Suppose we fit the linear model \[ y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_px_p+\varepsilon. \]

The Type III ANOVA table reports, for each predictor $x_j$,

\[ SS^{(III)}(x_j)=SSE(\text{full model with }\beta_j=0) - SSE(\text{full model}). \] That is, each sum of squares measures the increase in unexplained variability that results from constraining the coefficient $\beta_j$ to be zero while keeping all other predictors in the model.

Source	Degrees of Freedom	Sum of Squares	Mean Square	F
$x_1$	1	$SS^{(III)}(x_1)$	$MS(x_1)$	$F_1$
$x_2$	1	$SS^{(III)}(x_2)$	$MS(x_2)$	$F_2$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
Residuals	$n-p-1$	SSE	MSE

Each F-statistic tests \[ H_0: \beta_j=0 \quad \text{ given that all other predictors are in the model}. \]

The corresponding F-statistic is computed by \[ F_j=\frac{MS^{(III)}(x_j)}{MSE}. \] In the case of a single degree of freedom per predictor, this test is equivalent to the square of the $t$-test for the coefficient $\beta_j$.

In summary, the Type III ANOVA table answers the question:

Is this coefficient of the main effect nonzero in the full model, when all other effects (including interaction terms) presented in the model?

Type III vs t-tests

For a 1-degree-of-freedom term (e.g., a single continuous predictor), the Type III F-test is equivalent to the squared t-test for the corresponding coefficient: \[ F^{(III)}=t^2. \] For multi-df terms (e.g., a factor with multiple levels), Type III tests the whole set of coefficients for that term simultaneously.

Example 9.5

Click to expand.

set.seed(123)

n <- 80
x1 <- rnorm(n)
x2 <- 0.8 * x1 + rnorm(n, sd = 0.6) 
y <- 3 + 2 * x1 + 1 * x2 + rnorm(n, sd = 1)

library(car)
model <- lm(y ~ x1 * x2)
Anova(model, type = 3)

Anova Table (Type III tests)

Response: y
            Sum Sq Df  F value    Pr(>F)    
(Intercept) 456.53  1 493.7348 < 2.2e-16 ***
x1           71.05  1  76.8446 3.745e-13 ***
x2           37.63  1  40.6929 1.265e-08 ***
x1:x2         0.46  1   0.5005    0.4814    
Residuals    70.27 76                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model)


Call:
lm(formula = y ~ x1 * x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5617 -0.6946 -0.1291  0.4857  3.0756 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.08244    0.13872  22.220  < 2e-16 ***
x1           1.79433    0.20469   8.766 3.75e-13 ***
x2           1.22606    0.19220   6.379 1.27e-08 ***
x1:x2       -0.08372    0.11835  -0.707    0.481    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9616 on 76 degrees of freedom
Multiple R-squared:  0.8935,    Adjusted R-squared:  0.8893 
F-statistic: 212.6 on 3 and 76 DF,  p-value: < 2.2e-16

Please compare the t-statistic and the F-statistic for each variable.

Anova(model, type=2)

Anova Table (Type II tests)

Response: y
          Sum Sq Df F value    Pr(>F)    
x1        71.255  1 77.0631 3.545e-13 ***
x2        37.255  1 40.2911 1.447e-08 ***
x1:x2      0.463  1  0.5005    0.4814    
Residuals 70.272 76                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model12 <- lm(y~x1*x2)
model1 <- lm(y~x1)
model2 <- lm(y~x2)
fit1 <- lm(y~x1+x1:x2)
fit2 <- lm(y~x2+x1:x2)
fit <- lm(y~x1+x2)
s1 <- lm(y~x1)
s2 <- lm(y~x2)

Type II vs Type III

Type II tests main effects after adjusting for the other main effects, and (when interactions exist) it typically tests main effects in the additive model to preserve hierarchy. Interactions are tested by comparing the additive model to the interaction model.
Type III tests each term within the full model, even if that term participates in interactions. This can yield hypotheses that are sensitive to coding/centering and may be harder to interpret scientifically.

When there are no interactions, and predictors are coded in the usual way, Type II and Type III coincide (they reduce to the standard partial F-tests).

Example 9.6

Click to expand.

set.seed(123)

n <- 80
x1 <- rnorm(n)
x2 <- 0.8 * x1 + rnorm(n, sd = 0.6)
y <- x1 * x2 + 2 * x1 + 3 * x2 + rnorm(80)

fit <- lm(y ~ x1 * x2)
Anova(fit, type = 2)

Anova Table (Type II tests)

Response: y
           Sum Sq Df F value    Pr(>F)    
x1         69.075  1  74.705 6.443e-13 ***
x2        276.608  1 299.153 < 2.2e-16 ***
x1:x2      55.426  1  59.943 3.423e-11 ***
Residuals  70.272 76                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit2 <- lm(y ~ x1 + x2)
fit3 <- lm(y ~ x2)
anova(fit3, fit2)

Analysis of Variance Table

Model 1: y ~ x2
Model 2: y ~ x1 + x2
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     78 194.77                                  
2     77 125.70  1    69.075 42.314 7.083e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Please compare the sum of squares between the Type II ANOVA table of the 1st model y~x1*x2 and nested model in the 2nd case.

Term	Type II Logic (nested model comparison)	Type III Logic (coefficient test in full model)
$x_1$	Full: $y \sim x_1 + x_2$ Reduced: $y \sim x_2$	model: $y \sim x_1 + x_2 + x_1:x_2$ Test $H_0:\beta_1=0$
$x_2$	Full: $y \sim x_1 + x_2$ Reduced: $y \sim x_1$	model: $y \sim x_1 + x_2 + x_1:x_2$ Test $H_0:\beta_2=0$
$x_1:x_2$	Full: $y \sim x_1 + x_2 + x_1:x_2$ Reduced: $y \sim x_1 + x_2$	Full model kept: $y \sim x_1 + x_2 + x_1:x_2$ Test $H_0:\beta_3=0$

Source	Degrees of Freedom	Sum of Squares	Mean Square	F
\(x_1\)	1	\(SS(x_1)\)	\(MS(x_1)\)	\(F_1\)
\(x_2\)	1	\(SS(x_2 \mid x_1)\)	\(MS(x_2)\)	\(F_2\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
Residuals	\(n-p-1\)	SSE	MSE

Source	Degrees of Freedom	Sum of Squares	Mean Square	F
\(x_1\)	\(1\)	\(SS^{(II)}(x_1)\)	\(MS^{II}(x_1)\)	\(F_1\)
\(x_2\)	\(1\)	\(SS^{(II)}(x_2)\)	\(MS^{II}(x_2)\)	\(F_2\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
Residuals	\(n - p - 1\)	\(\mathrm{SSE}\)	\(\mathrm{MSE}\)

Source	Degrees of Freedom	Sum of Squares	Mean Square	F
\(x_1\)	1	\(SS^{(III)}(x_1)\)	\(MS(x_1)\)	\(F_1\)
\(x_2\)	1	\(SS^{(III)}(x_2)\)	\(MS(x_2)\)	\(F_2\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
Residuals	\(n-p-1\)	SSE	MSE

Term	Type II Logic (nested model comparison)	Type III Logic (coefficient test in full model)
\(x_1\)	Full: \(y \sim x_1 + x_2\) Reduced: \(y \sim x_2\)	model: \(y \sim x_1 + x_2 + x_1:x_2\) Test \(H_0:\beta_1=0\)
\(x_2\)	Full: \(y \sim x_1 + x_2\) Reduced: \(y \sim x_1\)	model: \(y \sim x_1 + x_2 + x_1:x_2\) Test \(H_0:\beta_2=0\)
\(x_1:x_2\)	Full: \(y \sim x_1 + x_2 + x_1:x_2\) Reduced: \(y \sim x_1 + x_2\)	Full model kept: \(y \sim x_1 + x_2 + x_1:x_2\) Test \(H_0:\beta_3=0\)