10 Regression with qualitative variables

When there are qualitative variables, the whole regression algorithm stay almost the same. The basic idea is still

\[ y=\text{systematic part}+\varepsilon. \]

The main difference is that a categorical predictor cannot be inserted directly as a number. Instead, we represent group membership by indicator variables (also called dummy variables), and then fit the model using the same least squares machinery as before.

Suppose a predictor has \(g\) categories. After choosing one category as the baseline, we create \(g-1\) dummy variables:

\[ x_j= \begin{cases} 1,& \text{if the observation is in category }j,\\ 0,& \text{otherwise.} \end{cases} \]

Then the categorical regression model has the form

\[ y=\beta_0+\beta_1x_1+\cdots+\beta_{g-1}x_{g-1}+\varepsilon. \]

Here:

\(\beta_0\) is the mean response for the baseline group.
\(\beta_j\) is the difference between the expected response of group \(j\) and the baseline group.
\(\varepsilon\) is the random error term.

Therefore it is just ordinary linear regression with specially coded predictors.

10.1 A one-way regression model with qualitative variables

Suppose a factor has three groups: A, B, and C. Take A as the baseline, and define

\[ x_1=I(\text{group B}),\quad x_2=I(\text{group C}). \]

Then the model is

\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\varepsilon. \]

The fitted means are

\[ \begin{aligned} E(y\mid A)&=\beta_0,\\ E(y\mid B)&=\beta_0+\beta_1,\\ E(y\mid C)&=\beta_0+\beta_2. \end{aligned} \]

Therefore:

\(\beta_0\) is the mean of group A;
\(\beta_1\) is the difference between the mean of group B and the mean of group A;
\(\beta_2\) is the difference between the mean of group C and the mean of group A.

This is the simplest form of categorical regression.

10.2 One-Way ANOVA

One-way ANOVA (analysis of variance) is a classical method for comparing the means of several populations. Suppose we observe a quantitative response variable across \(k\) independent groups (or treatments). The primary question is whether the population means are equal:

\[ H_0:\ \mu_1=\mu_2=\cdots=\mu_k \qquad\text{vs}\qquad H_a:\ \text{not all }\mu_i\text{ are equal}. \]

The traditional one-way ANOVA model is

\[ y_{ij}=\mu+\tau_i+\varepsilon_{ij},\qquad i=1,\ldots,k,\ j=1,\ldots,n_i, \]

where

\(\mu\) is the overall mean,
\(\tau_i\) is the effect of group \(i\) (with \(\sum_{i=1}^k \tau_i=0\)), and hence the mean of group \(i\) is \(\mu_i=\mu+\tau_i\),
\(\varepsilon_{ij}\overset{iid}{\sim}N(0,\sigma^2)\).

ANOVA partitions the total variability into between-group and within-group components and uses an \(F\)-test to determine whether the group means differ significantly.

Suppose we observe

\[ y_{ij},\quad i=1,\dots,k,\ j=1,\dots,n_i, \]

where:

\(k\) = number of groups
\(n_i\) = sample size in group \(i\)
\(N=\sum n_i\) = total sample size

Define:

group means
\[ \bar y_i=\frac{1}{n_i}\sum_{j=1}^{n_i}y_{ij} \]
overall mean
\[ \bar y=\frac{1}{N}\sum_{i=1}^k\sum_{j=1}^{n_i}y_{ij}. \]
Between-group sum of squares: this measures how far group means are from the overall mean. \[ SSB=\sum_{i=1}^k n_i(\bar y_i-\bar y)^2. \]
Between-group variance (mean square): this is called the between-group variance in ANOVA. \[ MSB=\frac{SSB}{k-1}. \]
Within-group sum of squares: this measures variation within each group. \[ SSW=\sum_{i=1}^k\sum_{j=1}^{n_i}(y_{ij}-\bar y_i)^2. \]
Within-group variance (mean square): this is called the within-group variance and serves as the pooled estimate of \(\sigma^2\) under \(H_0\). \[ MSW=\frac{SSW}{N-k}. \]
The one-way ANOVA test statistic is the F-test with the F-statistic: \[ F=\frac{MSB}{MSW}. \]

All values mentioned above are summarized in the following standard One-Way ANOVA table

Source	SS	df	MS	F
Between (Groups)	\(SSB\)	\(k-1\)	\(\frac{SSB}{k-1}\)	\(\frac{MSB}{MSW}\)
Within (Error)	\(SSW\)	\(N-k\)	\(\frac{SSW}{N-k}\)
Total	\(SST\)	\(N-1\)

Similar to all the ANOVA tables we mentioned before, we have a decomposition of variance that \(SST=SSB+SSW\).

Example 10.1 Scores:

A: 78, 82, 85, 80
B: 88, 90, 85, 87
C: 72, 75, 70, 74

Group means:

\[ \bar y_A=81.25,\quad \bar y_B=87.5,\quad \bar y_C=72.75,\quad \bar y=80.5. \]

\[ \begin{split} SSB&=4(81.25-80.5)^2+4(87.5-80.5)^2+4(72.75-80.5)^2=434,\\ MSB&=\frac{434}{2}=217,\\ SSW&=26.75+13+14.75=54.5,\\ MSW&=\frac{54.5}{9}=6.06. \end{split} \] Then \[ F=\frac{MSB}{MSW}=\frac{217}{6.06}\approx 35.81. \] Under the null hypothesis \(H_0:\mu_A=\mu_B=\mu_C\), we have

\[ F\sim F_{2,9}. \]

So the p-value is

\[ p=P(F_{2,9}\ge 35.81)\approx 5.4\times 10^{-5}. \]

Since the p-value is much smaller than \(0.05\), we reject \(H_0\) and conclude that not all group means are equal.

10.3 Relation between the two models

One-way ANOVA can be viewed as a special case of multiple linear regression in which the predictor is categorical. Suppose the explanatory variable has \(k\) groups. Then we have the regression model

\[ y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_{k-1}x_{k-1}+\varepsilon, \]

where each \(x_j\) is a dummy variable indicating membership in one of the groups, and one group is treated as the reference group.

In this formulation:

\(\beta_0\) is the mean of the reference group,
each \(\beta_j\) measures the difference between the mean of group \(j\) and the mean of the reference group.

Thus, testing whether all group means are equal is equivalent to testing \[ H_0:\beta_1=\beta_2=\cdots=\beta_{k-1}=0. \]

So the usual one-way ANOVA \(F\)-test is exactly the same as the overall \(F\)-test in the corresponding regression model with a categorical predictor.

In this sense, one-way ANOVA and one-way categorical regression are not different methods; they are simply two equivalent ways to describe the same model. ANOVA emphasizes partitioning variation into between-group and within-group components, while regression emphasizes coefficients and predictor effects.

History of ANOVA and regression

Analysis of variance (ANOVA) and regression share a common origin in early 20th-century statistics. Regression was developed earlier by Francis Galton and later formalized by Karl Pearson in studying relationships between quantitative variables. ANOVA was introduced in the 1920s by Ronald A. Fisher to analyze experimental data by decomposing variability into systematic and random components. By the mid-20th century, especially through the work of C. R. Rao and others, ANOVA was recognized as a special case of the general linear model, showing that both methods are unified within the same mathematical framework.

Relation between ANOVA tables

Sicne the independent variable is dummy variable which indicates the group index, the regreesion model simply means that the response variable is modeled by the group mean. In other words, once we decide to talk about the i-th group, the fitted value is the group mean \(\hat y_{ij}=\bar y_i\).

Then \[ \begin{split} SSR&=\sum_{i=1}^k\sum_{j}^{n_i}(\hat y_{ij}-\bar y)^2=\sum_{i=1}^k\sum_{j}^{n_i}(\bar y_i-\bar y)^2=\sum_{i=1}^kn_i(\bar y_i-\bar y)^2=SSB,\\ SSE&=\sum_{i=1}^k\sum_{j}^{n_i}(y_{ij}-\hat y_{ij})^2=\sum_{i=1}^k\sum_{j}^{n_i}(y_{ij}-\bar y_{i})^2=SSW. \end{split} \]

Therefore the One-Way ANOVA table is exactly the ANOVA table for this particular multiple linear regression model.

Degree of freedom of predictors

The predictor is also called a factor, and the number of groups is the levels of the factor. Just like mentioned previously, the factor can be modeled by dummy variables. However, since there is only one factor, we treat it as a single factor variable with degree of freedom \(k-1\).

Therefore although the regression form seems to have \(k-1\) variables, it is essentially a regression model with one variable of degree of freedom of \(k-1\).

Type III ANOVA table and the summary table

The summary table reports individual t-tests for each coefficient, while the Type III ANOVA table reports a nested-model F-test for each model term. They answer the same hypothesis only when the term has one degree of freedom. In that case, \[ F=t^2, \]

so the Type III test and the corresponding row in the summary table give identical results.

However, for qualitative variables with multiple levels, the term typically has more than one degree of freedom. In that case:

the summary table tests coefficients individually (e.g., each group vs the reference),
the Type III ANOVA table tests the term jointly.

So they generally answer different questions and may give different conclusions.

10.4 R code

In R, we use factor to deal with qualitative variables.

x <- factor(c(1,2,3,1,2,3,1,1,2))
x

[1] 1 2 3 1 2 3 1 1 2
Levels: 1 2 3

When using lm to process a linear regression model, if a variable is a factor, it will be automatically changed into a dummy variable.

Example 10.2 We could genearte a data of 3 groups.

n <- 5 
group <- factor(rep(c("A", "B", "C"), each = n))

y <- c(
    rnorm(n, mean = 0, sd = 1),
    rnorm(n, mean = 1, sd = 1),
    rnorm(n, mean = 2, sd = 1)
)

dat <- data.frame(group, y)
dat

   group           y
1      A -0.03115931
2      A -0.67955491
3      A -0.11465412
4      A -1.45579668
5      A  0.27213029
6      B  1.09206072
7      B  0.50478897
8      B  0.25566921
9      B  2.27738793
10     B  1.51993402
11     C  4.26617629
12     C -0.37491983
13     C  3.27276584
14     C  1.56154066
15     C  2.36966052

Code

# This is a function to generate arbitrary groups of data with customized mean and standard deviations.

gen_groups <- function(mu, sd, n = 10, seed = NULL) {
    if (!is.null(seed)) set.seed(seed)

    Lmu <- length(mu)
    Lsd <- length(sd)

    if (Lmu == Lsd) {
        k <- Lmu
    } else if (Lmu > Lsd && Lmu %% Lsd == 0) {
        sd <- rep(sd, length.out = Lmu)
        k <- Lmu
    } else if (Lsd > Lmu && Lsd %% Lmu == 0) {
        mu <- rep(mu, length.out = Lsd)
        k <- Lsd
    } else {
        stop("Lengths of mu and sd must match or be multiples (broadcastable).")
    }

    if (length(n) == 1) n <- rep(n, k)

    if (length(n) != k) {
        stop("n must be length 1 or same as number of groups")
    }

    y <- unlist(Map(
        function(m, s, ni) {
            rnorm(ni, mean = m, sd = s)
        },
        mu, sd, n
    ))

    group <- factor(rep(seq_len(k), times = n))

    data.frame(group, y)
}

dat <- gen_groups(c(1, 2, 3), c(2, 1, 1), n=10)

fit <- lm(y ~ group, data=dat)
summary(fit)


Call:
lm(formula = y ~ group, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5940 -0.6413  0.1506  0.5320  2.0471 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.4018     0.5319  -0.755  0.46453   
groupB        1.5318     0.7522   2.037  0.06439 . 
groupC        2.6209     0.7522   3.484  0.00451 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.189 on 12 degrees of freedom
Multiple R-squared:  0.5053,    Adjusted R-squared:  0.4228 
F-statistic: 6.128 on 2 and 12 DF,  p-value: 0.01466

R automatically converts the factor variable into dummy variables. The first level of the factor (which is group A in this example) is used as the base group.

We can relevel the factor to change the base group.

dat$group <- relevel(dat$group, ref = "B")
fit <- lm(y ~ group, data = dat)
summary(fit)


Call:
lm(formula = y ~ group, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5940 -0.6413  0.1506  0.5320  2.0471 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.1300     0.5319   2.125   0.0551 .
groupA       -1.5318     0.7522  -2.037   0.0644 .
groupC        1.0891     0.7522   1.448   0.1733  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.189 on 12 degrees of freedom
Multiple R-squared:  0.5053,    Adjusted R-squared:  0.4228 
F-statistic: 6.128 on 2 and 12 DF,  p-value: 0.01466

In this case, all three types of ANOVA tables as well as the ANOVA table for the linear regression are the same, because the model contains only one factor variable. They are different from the summary table, because the summary table tests the individual dummy-variable coefficients, while the ANOVA tables test the factor as a whole.

anova(fit)

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value  Pr(>F)  
group      2 17.335  8.6677  6.1285 0.01466 *
Residuals 12 16.972  1.4143                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(car)

Loading required package: carData

Anova(fit, type = 2)

Anova Table (Type II tests)

Response: y
          Sum Sq Df F value  Pr(>F)  
group     17.335  2  6.1285 0.01466 *
Residuals 16.972 12                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(fit, type = 3)

Anova Table (Type III tests)

Response: y
             Sum Sq Df F value  Pr(>F)  
(Intercept)  6.3841  1  4.5139 0.05508 .
group       17.3355  2  6.1285 0.01466 *
Residuals   16.9721 12                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova_alt(fit)

Analysis of Variance Table

       Df     SS     MS      F        P
Source  2 17.335 8.6677 6.1285 0.014658
Error  12 16.972 1.4143                
Total  14 34.308

10.5 Categorical predictors together with quantitative predictors

Categorical variables can be combined with quantitative variables in the same linear model. For example,

\[ y=\beta_0+\beta_1x+\beta_2z_1+\beta_3z_2+\varepsilon, \]

where \(x\) is quantitative and \(z_1,z_2\) are dummy variables for a three-level factor.

In this model:

the factor shifts the intercept across groups;
the slope of \(x\) is the same for all groups.

If we believe the slope should also vary by group, then we add interaction terms:

\[ y=\beta_0+\beta_1x+\beta_2z_1+\beta_3z_2+\beta_4xz_1+\beta_5xz_2+\varepsilon. \]

Now each group may have a different regression line.

Example 10.3

set.seed(1)

n <- 40
group <- factor(rep(c("A", "B", "C"), each = n))

x <- rnorm(3 * n)

z1 <- as.numeric(group == "B")
z2 <- as.numeric(group == "C")

y <- 2 + 1.5 * x + 2 * z1 - 1 * z2 + 1 * x * z1 - 0.5 * x * z2 + rnorm(3 * n)

dat <- data.frame(y, x, group)

fit <- lm(y ~ x * group, data = dat)

summary(fit)


Call:
lm(formula = y ~ x * group, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.63835 -0.61723  0.05876  0.53719  2.46821 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.6766     0.1587  10.567  < 2e-16 ***
x             1.6415     0.1802   9.108 3.31e-15 ***
groupB        2.5294     0.2247  11.255  < 2e-16 ***
groupC       -0.7645     0.2248  -3.400 0.000929 ***
x:groupB      0.4677     0.2495   1.874 0.063463 .  
x:groupC     -0.2891     0.2600  -1.112 0.268548    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.998 on 114 degrees of freedom
Multiple R-squared:  0.8253,    Adjusted R-squared:  0.8177 
F-statistic: 107.7 on 5 and 114 DF,  p-value: < 2.2e-16

anova_alt(fit)

Analysis of Variance Table

        Df     SS      MS      F          P
Source   5 536.40 107.280 107.72 1.6232e-41
Error  114 113.53   0.996                  
Total  119 649.93

Group A is the base group, which is \(\operatorname{E}(y)=1.6765878+1.6414861x\).
Group B, compared with A, its intercept is increased by 2.5293716, and its slope is increased by 0.4676997. Therefore the model for group B is \(\operatorname{E}(y)=4.2059593+2.1091858 x\).
Group C, compared with A, its intercept is increased by -0.7644747, and its slope is increased by -0.2890756. Therefore the model for group C is \(\operatorname{E}(y)= 0.9121131+1.3524105 x\).
From the summary table, both interaction terms (x:groupB and x:groupC) have p-values greater than 0.05, so there is insufficient evidence to conclude that the slopes differ across groups. Thus, the main differences among groups are primarily in their intercepts.
From the ANOVA table, however, the interaction term x:group is significant overall, indicating that the slopes are not exactly the same across groups when tested jointly.
Although neither interaction coefficient is individually significant in the summary table, the ANOVA test shows that the interaction is significant overall. This indicates that the slopes are not identical across groups, even though no single slope difference is strong enough to be detected individually.