2  Probability

2.1 Notations

  • \(Y\): a random variable (captial letters)
  • \(y\): a sample of \(Y\)
  • \(\Pr\qty(Y\in A\mid\theta)\): the probability of \(Y\) being in \(A\)
  • \(p(y\mid\theta)=\Pr\qty(Y=y\mid\theta)\): the probability mass function (discrete case)
  • \(f(y\mid\theta)=\displaystyle\dv{y}\Pr\qty(Y\leq y\mid\theta)\): the probability density function (continuous case)
  • \(\Exp\qty(Y)\): the expectation of \(Y\)
  • \(\Var\qty(Y)\): the variance of \(Y\)

2.2 Random variables

Definition 2.1 (Expectation) \[ \Exp\mqty[u(X)] = \int_{-\infty}^{\infty}u(x)f(x)\dl3x. \]

Definition 2.2  

  1. \(\mu=\Exp(X)\) is called the mean value of \(X\).
  2. \(\sigma^2=\Var(X)=\Exp\mqty[(X-\mu)^2]\) is called the variance of \(X\).
  3. \(M_X(t)=\Exp\mqty[\me^{tX}]\) is called the moment generating function of \(X\).

Proposition 2.1  

  1. \(\Exp\mqty[ag(X)+bh(X)]=a\Exp\mqty[g(X)]+b\Exp\mqty[h(X)]\).
  2. \(\Var\mqty[X]=\Exp\mqty[(X-\mu)^2]=\Exp(X^2)-\mu^2\).
  3. If \(X\) and \(Y\) are independent, \(\Var\mqty[aX+bY]=a^2\Var(X)+b^2\Var(Y)\).
Click for proof.

\[ \begin{split} \Exp\mqty[ag(X)+bh(X)]&=\int_{-\infty}^{\infty}\mqty[ag(x)+bh(x)]f(x)\dl3x\\ &=a\int_{-\infty}^{\infty}g(x)f(x)\dl3x+b\int_{-\infty}^{\infty}h(x)f(x)\dl3x\\ &=a\Exp\mqty[g(X)]+b\Exp\mqty[h(X)]. \end{split} \]

\[ \begin{split} \Exp\mqty[(X-\mu)^2]&=\Exp\mqty[\qty(X^2-2\mu X+\mu^2)]=\Exp(X^2)-2\mu\Exp(X)+\Exp(\mu^2)\\ &=\Exp(X^2)-2\mu\mu+\mu^2=\Exp(X^2)-\mu^2. \end{split} \]

\[ \begin{split} \Var\mqty[aX]&=\Exp(a^2X^2)-a^2\mu^2=a^2\qty(\Exp(X^2)-\mu^2)=a^2\Var(X),\\ \Var\mqty[X+Y]&=\Exp((X+Y)^2)-(\Exp(X+Y))^2\\ &=\Exp(X^2)+\Exp(Y^2)+2\Exp(XY)-\Exp(X)^2-\Exp(Y)^2-2\Exp(X)\Exp(Y)\\ &=\Var(X)+\Var(Y)+2(E(XY)-E(X)E(Y))\\ &=\Var(X)+\Var(Y),\\ \Var\mqty[aX+bY]&=a^2\Var(X)+b^2\Var(Y). \end{split} \]

Note

Assume \(X_1,\ldots, X_n\) i.i.d. with mean \(\mu\) and variance \(\sigma^2\). Then \(\Var(\frac1n\sum X_i)=\sigma^2/n\). This implies that the more samples you pick, the smaller the variance is. This explains why, when possible, we want a large sample size. Note that we don’t specify any concrete distribution in this remark. This is related to estimation, which will be discussed in detail later.

2.2.1 R code

R has built-in random variables with different distributions. The naming convention is a prefix d-, p-, q- and r- together with the name of distribution.

  • d-: density function of the given distribution;
  • p-: cumulative density function of the given distribution;
  • q-: quantile function of the given distribution (which is the inverse of p- function);
  • r-: random sampling from the given distribution.

Example 2.1 (Normal distribution)  

Click to expand.
x <- seq(-4, 4, length=100)
y <- dnorm(x, mean=2, sd=0.5)
plot(x, y, type="l")

x <- seq(-4, 4, length=100)
y <- pnorm(x, mean=2, sd=0.5)
plot(x, y, type="l")

qnorm(0.025)
## [1] -1.959964
qnorm(0.5)
## [1] 0
qnorm(0.975)
## [1] 1.959964
rnorm(10)
##  [1]  0.3832064  0.4811536  0.9407225  0.3473203  0.9705847  0.8000386
##  [7]  0.2221451  1.1320445 -1.9443368 -0.3623345

2.3 Some relations

2.3.1 Covariance

Assuming we have two random variables \(X\) and \(Y\), and we have two sets of realizations \(x_1,\ldots, x_n\) and \(y_1,\ldots, y_n\).

The sum of squares and cross-products are defined as

  • \(SS_{xx}=\sum_{i=1}^n(x_i-\bar x)^2\)
  • \(SS_{yy}=\sum_{i=1}^n(y_i-\bar y)^2\)
  • \(SS_{xy}=\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)\)

where \(\bar x\) and \(\bar y\) are sample mean. Note that if we normalize these quantities by \(n-1\), we will get the usual unbiased sample varaince and covariance.

Definition 2.3 (Covariance and sample covariance)  

  • \(\Cov(X,Y)=\Exp[(X-\mu_X)(Y-\mu_Y)]\), where \(\mu_X=\Exp(X)\) and \(\mu_Y=\Exp(Y)\).
  • \(s_{xy}=\frac1{n-1}SS_{xy}=\frac1{n-1}\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)\).

2.3.2 Pearson correlation coefficient

Definition 2.4 (Pearson correlation coefficient \(r\)) \[ r=\cor(x,y)=\frac{SS_{xy}}{\sqrt{{SS_{xx}SS_{yy}}}}. \]

Theorem 2.1 (Properties of Correlation)  

  1. \(-1\leq r\leq 1\).
    • \(r>0\) means positive correlation.
    • \(r<0\) means negative correlation.
    • \(r=0\) means no positive or negative linear correlation.
  2. \(\cor(x,y)=\cor(y, x)\)
  3. The correlation coefficient \(r\) is scaleless (which means that it doesn’t depend on the units of measurement used for the variables).

The Pearson correlation coefficient is used to measure the correlation of the two variables. If the relation is more closely stick to a straight line, the relation is stronger, and the resulted \(r\) will be bigger. If \(r=1\), \(X\) and \(Y\) will have a perfectly linear relation. This is not the same as the slope.

Example 2.2 Let \(x_i=i\) and \(y_i=0.001x_i\). Then the slope is \(0.001\) and the \(r=1\). We may add a small noise to it so \(y_i=0.001x_i+\varepsilon_i\) while \(\varepsilon_i\sim N(0,0.001)\). Then the slope is \(0.001\) but \(r\) is still close to \(1\).

x <- seq(1, 100)
y <- 0.001 * x + rnorm(100, 0, 1e-3)
plot(x, y)
abline(0, 0.001, col='red')

cor(x, y)
## [1] 0.9995427

2.3.3 Independence

Independence is the relation between two random variables.

Definition 2.5 (Independence) Random variables \(X\) and \(𝑌\) are said to be independent if their joint distribution factorizes into the product of their marginal distributions. That is, for all \(x\), \(y\) \[ F_{X,Y}(x,y)=F_X(x)F_Y(y). \]

If a joint density exists, this is equivalently written as

\[ f_{X,Y}(x,y)=f_X(x)f_Y(y). \]

It is equivalent to say that for all measurable functions \(g\) and \(h\),

\[ \Exp[g(X)h(Y)]=\Exp[g(X)]\Exp[h(Y)]. \]

Independence concerns the entire joint distribution, not just a summary statistic. Specifically, it cannot be derived from \(\Cov\) and \(\cor\).

Theorem 2.2 If \(X\) and \(Y\) are independent, then \(\Cov(X,Y)=0\) and \(\cor(X,Y)=0\). The converse is not true in general.

\(\Cov(X,Y)=0\) (equivalently, \(\cor(X,Y)=0\)) means that \(X\) and \(Y\) have no positive or negative linear relationship. However, this does not imply that \(X\) and \(Y\) are unrelated; they may still exhibit a nonlinear dependence.

Example 2.3  

x <- seq(-10, 10)
y <- abs(x)
cov(x, y)
## [1] 0
cor(x, y)
## [1] 0
plot(x, y)

They are not independent because \(y\) is literally defined as a function of \(x\). But their covariance and correlation are \(0\)s.

2.4 Some important distributions

2.4.1 Normal Distribution

Theorem 2.3 (Normal Sample Mean–Variance) Let \(X_1,\ldots,X_n\sim N(\mu,\sigma^2)\) i.i.d. Define

  • Sample mean: \(\bar X=\frac1n\sum_{i=1}^nX_i\)
  • Sample variance: \(s^2=\frac1{n-1}\sum_{i=1}^n(X_i-\bar X)^2\)

Then

  1. \(\bar X\sim N(\mu,\frac{\sigma^2}n)\)
  2. \(\frac{(n-1)s^2}{\sigma^2}\sim \chi^2_{n-1}\)
  3. \(\bar X\) and \(s^2\) are independent.

The complete proof is lengthy and out of scope of the course. Please refer to [1] for details.

2.4.2 Student’s t-Distribution

Theorem 2.4 (Student’s t-Distribution [1]) Let

  • \(Z\sim N(0,1)\) be a standard normal random variable
  • \(U\sim \chi_{\nu}^2\) be a chi-square random variable with \(\nu\) degrees of freedom.
  • \(Z\) and \(U\) are independent.

Then \[ T=\frac{Z}{\sqrt{U/\nu}} \] has a Student’s t-distribution with \(\nu\) degrees of freedom: \(T\sim t_{\nu}\).

These two theorems are usually used together. Let \(X_1,\ldots,X_n\sim N(\mu,\sigma^2)\) i.i.d. Then

  • \(\bar X\sim N(\mu,\sigma^2/n)\);
  • \(U=\frac{(n-1)s^2}{\sigma^2}\sim \chi^2_{n-1}\).

Therefore

  • \(Z=\frac{\bar X-\mu}{\sigma/\sqrt n}\sim N(0,1)\);
  • \(Z\) and \(U\) are indepedent.

So \[ T=\frac{Z}{\sqrt{U/\nu}}=\frac{\frac1{\sigma/\sqrt n}(\bar X-\mu)}{\sqrt{\frac{(n-1)s^2}{\sigma^2}/(n-1)}}=\frac{\bar X-\mu}{s/\sqrt n}\sim t_{n-1}. \]

In other words, if we standardize the sample mean using the sample standard deviation, we obtain a statistic that follows a t-distribution. This result is mainly used in the t-test.

Note

The normal distribution and the t-distribution are both bell-shaped and symmetric. When the sample size is large (typically (n > 30)), the t-distribution becomes very similar to the standard normal distribution, so the normal approximation is usually acceptable. When the sample size is small, however, the t-distribution has noticeably heavier tails, and it is better to use the t-distribution directly for inference.

2.5 Miscs

Theorem 2.5 (Empirical Rule (Normal Data)) If a random variable \(X\) is approximately normally distributed with mean \(\mu\) and standard deviation \(\sigma\), then:

  • About \(68\%\) of observations lie within \(1\sigma\) of the mean: [ ]

  • About \(95\%\) of observations lie within \(2\sigma\) of the mean: [ ]

  • About \(99.7\%\) of observations lie within \(3\sigma\) of the mean: [ ]

This is commonly known as the 68–95–99.7 rule.

Remark 2.1 (Precision of the 95% Rule). The statement that “about \(95\%\) of observations lie within \(2\sigma\) of the mean” is an approximation.

If [ X N(,^2), ] then the exact probability statement is [ P(- 1.96X + 1.96) = 0.95. ]

Thus, for normally distributed data:

  • The exact 95% cutoff is \(\pm 1.96\sigma\)
  • The value \(\pm 2\sigma\) is a convenient rounding

Remark 2.2 (Why \(\boldsymbol{2\sigma}\) Is Still Used). The cutoff \(\pm 2\sigma\) is widely used because:

  1. It is easy to remember and communicate.
  2. The difference from \(1.96\sigma\) is small: [ P(|Z|) . ]
  3. It provides a useful mental model of variability.

In contrast, the value \(1.96\) arises naturally in formal statistical inference, such as confidence intervals and hypothesis testing.

Remark 2.3 (Rule-of-Thumb vs. Exact Normal Theory).

  • The 68–95–99.7 rule is a descriptive heuristic.
  • The values \(1.645\), \(1.96\), and \(2.576\) come from exact normal quantiles.

Understanding this distinction is essential for: - Confidence intervals - Hypothesis testing - Interpretation of statistical uncertainty