x <- factor(c(1,2,3,1,2,3,1,1,2))
x[1] 1 2 3 1 2 3 1 1 2
Levels: 1 2 3
\[ \require{physics} \require{braket} \]
\[ \newcommand{\dl}[1]{{\hspace{#1mu}\mathrm d}} \newcommand{\me}{{\mathrm e}} \newcommand{\argmin}{\operatorname{argmin}} \]
$$
$$
\[ \newcommand{\pdfbinom}{{\tt binom}} \newcommand{\pdfbeta}{{\tt beta}} \newcommand{\pdfpois}{{\tt poisson}} \newcommand{\pdfgamma}{{\tt gamma}} \newcommand{\pdfnormal}{{\tt norm}} \newcommand{\pdfexp}{{\tt expon}} \]
\[ \newcommand{\distbinom}{\operatorname{B}} \newcommand{\distbeta}{\operatorname{Beta}} \newcommand{\distgamma}{\operatorname{Gamma}} \newcommand{\distexp}{\operatorname{Exp}} \newcommand{\distpois}{\operatorname{Poisson}} \newcommand{\distnormal}{\operatorname{\mathcal N}} \newcommand{\distunif}{\operatorname{Unif}} \]
When there are qualitative variables, the whole regression algorithm stay almost the same. The basic idea is still
\[ y=\text{systematic part}+\varepsilon. \]
The main difference is that a categorical predictor cannot be inserted directly as a number. Instead, we represent group membership by indicator variables (also called dummy variables), and then fit the model using the same least squares machinery as before.
Suppose a predictor has \(g\) categories. After choosing one category as the baseline, we create \(g-1\) dummy variables:
\[ x_j= \begin{cases} 1,& \text{if the observation is in category }j,\\ 0,& \text{otherwise.} \end{cases} \]
Then the categorical regression model has the form
\[ y=\beta_0+\beta_1x_1+\cdots+\beta_{g-1}x_{g-1}+\varepsilon. \]
Here:
Therefore it is just ordinary linear regression with specially coded predictors.
Suppose a factor has three groups: A, B, and C. Take A as the baseline, and define
\[ x_1=I(\text{group B}),\quad x_2=I(\text{group C}). \]
Then the model is
\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\varepsilon. \]
The fitted means are
\[ \begin{aligned} E(y\mid A)&=\beta_0,\\ E(y\mid B)&=\beta_0+\beta_1,\\ E(y\mid C)&=\beta_0+\beta_2. \end{aligned} \]
Therefore:
This is the simplest form of categorical regression.
One-way ANOVA (analysis of variance) is a classical method for comparing the means of several populations. Suppose we observe a quantitative response variable across \(k\) independent groups (or treatments). The primary question is whether the population means are equal:
\[ H_0:\ \mu_1=\mu_2=\cdots=\mu_k \qquad\text{vs}\qquad H_a:\ \text{not all }\mu_i\text{ are equal}. \]
The traditional one-way ANOVA model is
\[ y_{ij}=\mu+\tau_i+\varepsilon_{ij},\qquad i=1,\ldots,k,\ j=1,\ldots,n_i, \]
where
ANOVA partitions the total variability into between-group and within-group components and uses an \(F\)-test to determine whether the group means differ significantly.
Suppose we observe
\[ y_{ij},\quad i=1,\dots,k,\ j=1,\dots,n_i, \]
where:
Define:
All values mentioned above are summarized in the following standard One-Way ANOVA table
| Source | SS | df | MS | F |
|---|---|---|---|---|
| Between (Groups) | \(SSB\) | \(k-1\) | \(\frac{SSB}{k-1}\) | \(\frac{MSB}{MSW}\) |
| Within (Error) | \(SSW\) | \(N-k\) | \(\frac{SSW}{N-k}\) | |
| Total | \(SST\) | \(N-1\) |
Similar to all the ANOVA tables we mentioned before, we have a decomposition of variance that \(SST=SSB+SSW\).
Example 10.1 Scores:
Group means:
\[ \bar y_A=81.25,\quad \bar y_B=87.5,\quad \bar y_C=72.75,\quad \bar y=80.5. \]
\[ \begin{split} SSB&=4(81.25-80.5)^2+4(87.5-80.5)^2+4(72.75-80.5)^2=434,\\ MSB&=\frac{434}{2}=217,\\ SSW&=26.75+13+14.75=54.5,\\ MSW&=\frac{54.5}{9}=6.06. \end{split} \] Then \[ F=\frac{MSB}{MSW}=\frac{217}{6.06}\approx 35.81. \] Under the null hypothesis \(H_0:\mu_A=\mu_B=\mu_C\), we have
\[ F\sim F_{2,9}. \]
So the p-value is
\[ p=P(F_{2,9}\ge 35.81)\approx 5.4\times 10^{-5}. \]
Since the p-value is much smaller than \(0.05\), we reject \(H_0\) and conclude that not all group means are equal.
One-way ANOVA can be viewed as a special case of multiple linear regression in which the predictor is categorical. Suppose the explanatory variable has \(k\) groups. Then we have the regression model
\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_{k-1}x_{k-1}+\varepsilon, \]
where each \(x_j\) is a dummy variable indicating membership in one of the groups, and one group is treated as the reference group.
In this formulation:
Thus, testing whether all group means are equal is equivalent to testing \[ H_0:\beta_1=\beta_2=\cdots=\beta_{k-1}=0. \]
So the usual one-way ANOVA \(F\)-test is exactly the same as the overall \(F\)-test in the corresponding regression model with a categorical predictor.
In this sense, one-way ANOVA and one-way categorical regression are not different methods; they are simply two equivalent ways to describe the same model. ANOVA emphasizes partitioning variation into between-group and within-group components, while regression emphasizes coefficients and predictor effects.
Analysis of variance (ANOVA) and regression share a common origin in early 20th-century statistics. Regression was developed earlier by Francis Galton and later formalized by Karl Pearson in studying relationships between quantitative variables. ANOVA was introduced in the 1920s by Ronald A. Fisher to analyze experimental data by decomposing variability into systematic and random components. By the mid-20th century, especially through the work of C. R. Rao and others, ANOVA was recognized as a special case of the general linear model, showing that both methods are unified within the same mathematical framework.
Sicne the independent variable is dummy variable which indicates the group index, the regreesion model simply means that the response variable is modeled by the group mean. In other words, once we decide to talk about the i-th group, the fitted value is the group mean \(\hat y_{ij}=\bar y_i\).
Then \[ \begin{split} SSR&=\sum_{i=1}^k\sum_{j}^{n_i}(\hat y_{ij}-\bar y)^2=\sum_{i=1}^k\sum_{j}^{n_i}(\bar y_i-\bar y)^2=\sum_{i=1}^kn_i(\bar y_i-\bar y)^2=SSB,\\ SSE&=\sum_{i=1}^k\sum_{j}^{n_i}(y_{ij}-\hat y_{ij})^2=\sum_{i=1}^k\sum_{j}^{n_i}(y_{ij}-\bar y_{i})^2=SSW. \end{split} \]
Therefore the One-Way ANOVA table is exactly the ANOVA table for this particular multiple linear regression model.
The predictor is also called a factor, and the number of groups is the levels of the factor. Just like mentioned previously, the factor can be modeled by dummy variables. However, since there is only one factor, we treat it as a single factor variable with degree of freedom \(k-1\).
Therefore although the regression form seems to have \(k-1\) variables, it is essentially a regression model with one variable of degree of freedom of \(k-1\).
The summary table reports individual t-tests for each coefficient, while the Type III ANOVA table reports a nested-model F-test for each model term. They answer the same hypothesis only when the term has one degree of freedom. In that case, \[ F=t^2, \]
so the Type III test and the corresponding row in the summary table give identical results.
However, for qualitative variables with multiple levels, the term typically has more than one degree of freedom. In that case:
So they generally answer different questions and may give different conclusions.
In R, we use factor to deal with qualitative variables.
x <- factor(c(1,2,3,1,2,3,1,1,2))
x[1] 1 2 3 1 2 3 1 1 2
Levels: 1 2 3
When using lm to process a linear regression model, if a variable is a factor, it will be automatically changed into a dummy variable.
Example 10.2 We could genearte a data of 3 groups.
n <- 5
group <- factor(rep(c("A", "B", "C"), each = n))
y <- c(
rnorm(n, mean = 0, sd = 1),
rnorm(n, mean = 1, sd = 1),
rnorm(n, mean = 2, sd = 1)
)
dat <- data.frame(group, y)
dat group y
1 A -0.03115931
2 A -0.67955491
3 A -0.11465412
4 A -1.45579668
5 A 0.27213029
6 B 1.09206072
7 B 0.50478897
8 B 0.25566921
9 B 2.27738793
10 B 1.51993402
11 C 4.26617629
12 C -0.37491983
13 C 3.27276584
14 C 1.56154066
15 C 2.36966052
# This is a function to generate arbitrary groups of data with customized mean and standard deviations.
gen_groups <- function(mu, sd, n = 10, seed = NULL) {
if (!is.null(seed)) set.seed(seed)
Lmu <- length(mu)
Lsd <- length(sd)
if (Lmu == Lsd) {
k <- Lmu
} else if (Lmu > Lsd && Lmu %% Lsd == 0) {
sd <- rep(sd, length.out = Lmu)
k <- Lmu
} else if (Lsd > Lmu && Lsd %% Lmu == 0) {
mu <- rep(mu, length.out = Lsd)
k <- Lsd
} else {
stop("Lengths of mu and sd must match or be multiples (broadcastable).")
}
if (length(n) == 1) n <- rep(n, k)
if (length(n) != k) {
stop("n must be length 1 or same as number of groups")
}
y <- unlist(Map(
function(m, s, ni) {
rnorm(ni, mean = m, sd = s)
},
mu, sd, n
))
group <- factor(rep(seq_len(k), times = n))
data.frame(group, y)
}
dat <- gen_groups(c(1, 2, 3), c(2, 1, 1), n=10)fit <- lm(y ~ group, data=dat)
summary(fit)
Call:
lm(formula = y ~ group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-2.5940 -0.6413 0.1506 0.5320 2.0471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4018 0.5319 -0.755 0.46453
groupB 1.5318 0.7522 2.037 0.06439 .
groupC 2.6209 0.7522 3.484 0.00451 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.189 on 12 degrees of freedom
Multiple R-squared: 0.5053, Adjusted R-squared: 0.4228
F-statistic: 6.128 on 2 and 12 DF, p-value: 0.01466
R automatically converts the factor variable into dummy variables. The first level of the factor (which is group A in this example) is used as the base group.
We can relevel the factor to change the base group.
dat$group <- relevel(dat$group, ref = "B")
fit <- lm(y ~ group, data = dat)
summary(fit)
Call:
lm(formula = y ~ group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-2.5940 -0.6413 0.1506 0.5320 2.0471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1300 0.5319 2.125 0.0551 .
groupA -1.5318 0.7522 -2.037 0.0644 .
groupC 1.0891 0.7522 1.448 0.1733
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.189 on 12 degrees of freedom
Multiple R-squared: 0.5053, Adjusted R-squared: 0.4228
F-statistic: 6.128 on 2 and 12 DF, p-value: 0.01466
In this case, all three types of ANOVA tables as well as the ANOVA table for the linear regression are the same, because the model contains only one factor variable. They are different from the summary table, because the summary table tests the individual dummy-variable coefficients, while the ANOVA tables test the factor as a whole.
anova(fit)Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
group 2 17.335 8.6677 6.1285 0.01466 *
Residuals 12 16.972 1.4143
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(car)Loading required package: carData
Anova(fit, type = 2)Anova Table (Type II tests)
Response: y
Sum Sq Df F value Pr(>F)
group 17.335 2 6.1285 0.01466 *
Residuals 16.972 12
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(fit, type = 3)Anova Table (Type III tests)
Response: y
Sum Sq Df F value Pr(>F)
(Intercept) 6.3841 1 4.5139 0.05508 .
group 17.3355 2 6.1285 0.01466 *
Residuals 16.9721 12
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_alt(fit)Analysis of Variance Table
Df SS MS F P
Source 2 17.335 8.6677 6.1285 0.014658
Error 12 16.972 1.4143
Total 14 34.308
Categorical variables can be combined with quantitative variables in the same linear model. For example,
\[ y=\beta_0+\beta_1x+\beta_2z_1+\beta_3z_2+\varepsilon, \]
where \(x\) is quantitative and \(z_1,z_2\) are dummy variables for a three-level factor.
In this model:
If we believe the slope should also vary by group, then we add interaction terms:
\[ y=\beta_0+\beta_1x+\beta_2z_1+\beta_3z_2+\beta_4xz_1+\beta_5xz_2+\varepsilon. \]
Now each group may have a different regression line.
Example 10.3
set.seed(1)
n <- 40
group <- factor(rep(c("A", "B", "C"), each = n))
x <- rnorm(3 * n)
z1 <- as.numeric(group == "B")
z2 <- as.numeric(group == "C")
y <- 2 + 1.5 * x + 2 * z1 - 1 * z2 + 1 * x * z1 - 0.5 * x * z2 + rnorm(3 * n)
dat <- data.frame(y, x, group)
fit <- lm(y ~ x * group, data = dat)summary(fit)
Call:
lm(formula = y ~ x * group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-2.63835 -0.61723 0.05876 0.53719 2.46821
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6766 0.1587 10.567 < 2e-16 ***
x 1.6415 0.1802 9.108 3.31e-15 ***
groupB 2.5294 0.2247 11.255 < 2e-16 ***
groupC -0.7645 0.2248 -3.400 0.000929 ***
x:groupB 0.4677 0.2495 1.874 0.063463 .
x:groupC -0.2891 0.2600 -1.112 0.268548
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.998 on 114 degrees of freedom
Multiple R-squared: 0.8253, Adjusted R-squared: 0.8177
F-statistic: 107.7 on 5 and 114 DF, p-value: < 2.2e-16
anova_alt(fit)Analysis of Variance Table
Df SS MS F P
Source 5 536.40 107.280 107.72 1.6232e-41
Error 114 113.53 0.996
Total 119 649.93
x:groupB and x:groupC) have p-values greater than 0.05, so there is insufficient evidence to conclude that the slopes differ across groups. Thus, the main differences among groups are primarily in their intercepts.x:group is significant overall, indicating that the slopes are not exactly the same across groups when tested jointly.