Appendix C — Types of Errors

C.1 Hypothesis tests

When we analyze a dataset, we often formulate a claim and want to determine whether the data provide sufficient evidence to support it. The typical process is:

  1. Formulate a null hypothesis \(H_0\) and an alternative hypothesis \(H_a\), which are mutually exclusive and collectively exhaustive.
  2. Compute a test statistic \(T\) from the data. Typically, values of \(T\) that are unlikely under \(H_0\) provide evidence against \(H_0\).
  3. Choose a cutoff (critical value) \(c\) based on a pre-specified significance level \(\alpha\), and reject \(H_0\) if \(T\) falls in the rejection region (for example, if \(T>c\) in a right-tailed test).
  4. Alternatively, consider the sampling distribution of \(T\) under \(H_0\) with cdf \(F\). Then deciding whether \(T>c\) is equivalent to checking whether \(1-F(T)<\alpha\). The quantity \(1-F(T)\) represents the probability of observing a value of \(T\) at least as extreme as the observed one (for a right-tailed test).

Note that the rejection region depends on the setting. For example, we may reject \(H_0\) when \(T>c\) (right-tailed test), when \(T<c\) (left-tailed test), or when \(|T|>c\) (two-tailed test). However, once the rule is expressed in terms of the p-value, the decision criterion becomes unified: we reject \(H_0\) whenever \(p<\alpha\). This is why statisticians developed the p-value approach.

Definition C.1 (p-value) The probability, assuming \(H_0\) is true, of observing a test statistic at least as extreme as the one observed is called p-value. \[ p=\Pr(\text{data or more extreme}\mid H_0). \] Assume the test statistic \(T\) has a cdf \(F\), and assume the cut off of determining “extremeness” is \(t_{\alpha}\), then \[ p=\Pr(\text{data or more extreme}\mid H_0)=\Pr(t\geq t_{\alpha}\mid H_0)=1-F(t_\alpha). \]

Theorem C.1 For a continuous hypothesis test, if we reject \(H_0\) whenever \(p \le \alpha\), then

\[ P(p \le \alpha \mid H_0)=\alpha. \]

Hence the test has Type I error probability exactly \(\alpha\).

Click for proof.

Proof C.1 (Proof). \[ \begin{split} \Pr(\text{Type I error})&=\Pr(\text{reject } H_0\mid H_0\text{ is true})=\Pr(p\leq \alpha\mid H_0)\\ &=\Pr(1-F(c_{\alpha})\leq \alpha\mid H_0)=\Pr(F(c_{\alpha})\geq 1-\alpha\mid H_0)\\ &=1-\Pr(F(c_{\alpha})\leq 1-\alpha\mid H_0)\\ &=1-(1-\alpha)=\alpha. \end{split} \]

Here we use Probability Integral Transform Theorem.

Theorem C.2 (Probability Integral Transform Theorem) If \(X\) is continous, and assume its cdf is \(F_X\), then \[ U=F(X)\sim \distunif(0,1). \]

C.2 Errors

We already mentioned Type I error a few times in the previous section. Here we give a more systematic discussion.

Suppose we want to detect a signal to determine whether a certain phenomenon occurs. There are four possible outcomes:

  • It happens and we detect the signal: True positive
  • It happens and we fail to detect the signal: False negative
  • It doesn’t happen and we didn’t detect the signal: True negative
  • It doesn’t happen and we detect a signal: False positive
Detect signal Do not detect signal
It happens True positive (TP) False negative (FN)
It doesn’t happen False positive (FP) True negative (TN)

Both false positives and false negatives are errors.

  • False Positive is called Type I error
  • False Negative is called Type II error

In hypothesis testing, the null hypothesis \(H_0\) typically represents the default position, and we look for evidence (a “signal”) against it. Therefore, we may make mistakes. Rewriting the table in terms of hypothesis testing:

Reject \(H_0\) Fail to reject \(H_0\)
\(H_0\) is false True positive (TP) Type II error
\(H_0\) is true Type I error True negative (TN)

To evaluate the effectiveness of a hypothesis test, we study two cases:

  • If \(H_0\) is true, we either make a correct decision or commit a Type I error. The probability of a Type I error is exactly \(\alpha\).
  • If \(H_0\) is false, we either make a correct decision or commit a Type II error. The probability of a Type II error is denoted by \(\beta\). Note that \(\beta\) depends on the concrete parameters of \(H_a\).

Definition C.2 (Power) Suppose we conduct a hypothesis test concerning a parameter \(\mu\). The power function of the test is \[ \operatorname{power}(\mu)=\Pr(\text{reject }H_0\mid \mu). \]

If \(\mu\) corresponds to a value under which \(H_0\) is false, then \(\operatorname{power}(\mu)\) is the probability of a true positive.

If \(\mu\) corresponds to a value under which \(H_0\) is true, then \(\operatorname{power}(\mu)=\alpha\) which is the Type I error rate.

Truth Decision Probability
\(H_0\) true reject $ $
\(H_0\) true not reject $ 1-$
\(H_0\) false reject $ ()$
\(H_0\) false not reject \(\beta(\mu)=1-\operatorname{power}(\mu)\)
TipInterpretation in Practice

In practice, we do not know whether \(H_0\) is true or false. Therefore, hypothesis testing is evaluated by:

  • its Type I error rate when \(H_0\) is true, and
  • its power when \(H_0\) is false.

Although one could formally combine \(1-\alpha\) and \(\operatorname{power}(\mu)\) into an overall success rate, classical (frequentist) hypothesis testing typically evaluates these two cases separately, since the true value of \(\mu\) is unknown.

C.3 One-Sample t-Test

Suppose we observe independent and identically distributed data \[ x_1,x_2,\ldots, x_n\sim N(\mu, \sigma^2). \] We want to test \[ H_0:\mu=0\quad \text{versus}\quad H_a:\mu\neq 0. \]

The samples are summarized: \[ \bar x=\frac{1}{n}\sum_{i=1}^n x_i,\quad s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2}. \] We use them to compute the t-statistic \[ T=\frac{\bar x}{s/\sqrt{n}}. \]

If \(H_0\) is true, i.e., \(\mu=0\), \[ T\sim t_{n-1}. \] Then for significance level \(\alpha\), we define the critical value \[ c_{\alpha}=t_{1-\alpha/2,n-1} \quad\text{ by }\quad F_t(t_{1-\alpha/2,n-1})=1-\alpha/2 \] where \(F_t\) is the cdf of \(t_{n-1}\). Then we reject \(H_0\) when \[ \abs{T}>c_{\alpha}. \]

The critical value \(c_{\alpha}\) can be computed by the R code

tcrit <- qt(1 - alpha / 2, df)

If \(H_0\) is false, i.e., \(\mu\neq 0\), \[ T\sim t_{n-1}(\delta),\quad\text{where }\delta=\frac{\mu}{\sigma/\sqrt n}. \] Here the distribution \(t_{n-1}(\delta)\) is the noncentral t-distribution. It is t-distribution which shifted away from \(0\). Therefore the probability of falling into the rejection region becomes larger.

Since we still use the same critical level \(c_{\alpha}=t_{1-\alpha/2,n-1}\), we have \[ \begin{split} \operatorname{power}(\mu)&=\Pr(\abs{T}>c_{\alpha}\mid \mu)\\ &=\Pr(T<-c_{\alpha}\mid \mu) + \Pr(T>c_{\alpha}\mid \mu)\\ &=F_{t(\delta)}(-c_{\alpha}) + 1 - F_{t(\delta)}(c_{\alpha}). \end{split} \] Its value can be evaluated by the R code

tcrit <- qt(1 - alpha / 2, df)
power <- 1 - pt(tcrit, df, ncp = delta) + pt(-tcrit, df, ncp = delta)

In either case, the way to perform the test is to compute the t-statistic, and use the t-distribution formula to compute the p-value in order to compare with \(\alpha\).

tstat <- mean(x) / (sd(x) / sqrt(n))
p <- 2 * (1 - pt(abs(tstat), df = n - 1))
decision <- (p < alpha)

This process can be automated by the build-in function t.test.

t_test <- t.test(x, mu = 0)
decision <- (t_test$p.value < alpha)

C.4 Example

We generate data from rnorm(n, mu, sd) and conduct a one-sample \(t\)-test for \[ H_0:\mu=0 \qquad\text{versus}\qquad H_a:\mu\ne 0. \]

We consider two cases:

  • \(\mu=0\), where \(H_0\) is true;
  • \(\mu=0.5\), where \(H_0\) is false.

In both cases we set \(\sigma=1\), \(n=25\), and \(\alpha=0.05\).

When \(\mu=0.5\), the null hypothesis \(H_0\) is false, so we can compute the theoretical power of the test.

mu_sim <- 0.5
sigma <- 1
n <- 25
alpha <- 0.05

delta <- mu_sim / (sigma / (sqrt(n)))
df <- n - 1
tcrit <- qt(1 - alpha / 2, df)

theoretical_power <- 1 - pt(tcrit, df, ncp = delta) + pt(-tcrit, df, ncp = delta)
theoretical_power
[1] 0.6697077

So the theoretical power in this case is \(0.6697077\). In other words, when the true mean is \(\mu=0.5\), the probability of correctly rejecting \(H_0\) is \(0.6697077\).

When \(\mu=0\), \(H_0\) is true. In that case, the probability of incorrectly rejecting \(H_0\) is the Type I error rate, which is \(\alpha=0.05\).

We now demonstrate both scenarios by simulation.

set.seed(123)

B <- 200000
n <- 25
alpha <- 0.05

truth <- decision <- numeric(B)

for (i in 1:B) {
    if (runif(1) < 0.5) {
        mu_sim <- 0
        truth[i] <- 0 # H0 is true
    } else {
        mu_sim <- 0.5
        truth[i] <- 1 # Ha is true
    }

    x <- rnorm(n, mean = mu_sim, sd = 1)

    t_test <- t.test(x, mu_sim = 0)
    decision[i] <- (t_test$p.value < alpha)
}

table(truth, decision)
     decision
truth     0     1
    0 94664  5045
    1 33060 67231

We can convert this table into conditional rejection frequencies.

  • Type I error rate (false positive rate) is \[ \frac{5045}{5045+94664}=0.0505972\approx \alpha=0.05. \]
  • The power (true positive rate) is \[ \frac{67231}{67231+33060}=0.6703593\approx \operatorname{power}(0.5)=0.6697077. \]

So the simulation agrees with the theory:

  • when \(H_0\) is true, the rejection rate is about \(\alpha\), and
  • when \(H_0\) is false with \(\mu=0.5\), the rejection rate is about the theoretical power.

C.5 A little Bayesian

One natural question is: in practice, if a hypothesis test leads us to reject \(H_0\) at significance level \(\alpha\), how confident should we be that \(H_0\) is actually false?

This question corresponds to \[ \Pr(H_0\text{ is false}\mid \text{reject }H_0), \] which is typically not \(\alpha\) or \(\operatorname{power}\). Recall

  • \(\alpha=\Pr(\text{reject }H_0\mid H_0\text{ is true})\)
  • \(\operatorname{power}=\Pr(\text{reject }H_0\mid H_0\text{ is false})\)

Then by Bayes’ Theorem \[ \begin{split} \Pr(H_0\text{ is false}\mid \text{reject }H_0)&=\frac{\Pr(\text{reject }H_0\mid H_0\text{ is false})\Pr(H_0\text{ is false})}{\Pr(\text{reject }H_0\mid H_0\text{ is false})\Pr(H_0\text{ is false})+\Pr(\text{reject }H_0\mid H_0\text{ is true})\Pr(H_0\text{ is true})}\\ &=\frac{\operatorname{power}\cdot\Pr(H_0\text{ is false})}{\operatorname{power}\cdot\Pr(H_0\text{ is false})+\alpha\cdot\Pr(H_0\text{ is true})}. \end{split} \]

This can also be understood using the decision table:

Reject \(H_0\) Fail to reject \(H_0\)
\(H_0\) is false True positive rate (\(\operatorname{power}\)) Type II error rate (\(\beta=1-\operatorname{power}\))
\(H_0\) is true Type I error rate (\(\alpha\)) True negative rate (\(1-\alpha\))

To interpret a rejection, we focus on the first column. Among all rejections, some are true positives and some are false positives. Therefore, after rejecting \(H_0\), our confidence that \(H_0\) is false depends not only on the test level \(\alpha\) and the power, but also on how likely it was beforehand that \(H_0\) was false.

In practice, however, this quantity is difficult to determine. The true power is usually unknown because it depends on the true effect size and variance, and the prior probability that \(H_0\) is false is rarely known in real applications.

C.6 Interpreting a Rejection (Frequentist View)

From a frequentist perspective, we do not assign probabilities to whether \(H_0\) is true or false. Instead, hypothesis testing evaluates the long-run behavior of a testing procedure under repeated sampling.

The performance of a test can be summarized by the decision table:

Reject \(H_0\) Fail to reject \(H_0\)
\(H_0\) is false True positive rate (\(\operatorname{power}\)) Type II error rate (\(\beta=1-\operatorname{power}\))
\(H_0\) is true Type I error rate (\(\alpha\)) True negative rate (\(1-\alpha\))

In the frequentist framework:

  • \(\alpha=\Pr(\text{reject }H_0\mid H_0\text{ true})\) describes the long-run false positive rate.
  • \(\operatorname{power}=\Pr(\text{reject }H_0\mid H_0\text{ false})\) describes the ability of the test to detect real effects.

These quantities characterize the reliability of the testing procedure across repeated samples. However, they are interpreted separately and are not combined into a single number summarizing the “precision” of a test.

Therefore, after rejecting \(H_0\), the correct frequentist interpretation is:

If \(H_0\) were true, a result this extreme would occur with probability at most \(\alpha\).

In practice, a good test is one with a small Type I error rate and high power, making its decisions reliable across repeated samples.

Note

Frequentist inference evaluates long-run performance through error rates, whereas Bayesian inference evaluates how plausible \(H_0\) is after observing the data.