Generating Functions

The Probability Generating function of the non negative random variable, \( X \), where \( X \) has values in \( \mathbb{Z} \) is defined to be
\begin{align}
G(s) = \mathbb{E}[s^X] = \sum_k s^k \mathbb{P}(X=k) = \sum_k s^k f(k)
\end{align}

\( G(0) = \mathbb{P}(X=0) \) and \( G(1) = 1 \). Clearly \( G(s) \) converges for \( |s| < 1 \)

Generating functions are very useful to study the sums of independent random variables and to calculate the moments of a distribution.

Examples.
Constant Variables
If \( \mathbb{P}\left\{X = c\right\} = 1 \) then \( G(s) = \mathbb{E}[s^X] = s^c \)
Bernouilli variables
If \( \mathbb{P}\left\{X=1\right\} = p \) and \( \mathbb{P} \left\{X=0\right\} = 1-p \) then \( G(s) = \mathbb{E}[s^X] = 1-p + ps \)
Binomial variable
\( G(s) = \mathbb{E}[s^X] = (q + ps)^n \)
Poisson distribution
If \( X \) is a Poisson distribution with parameter \( \lambda \) then \begin{align} G(s) = \mathbb{E}(s^X) = \sum_{k=0}^\infty s^k \frac{\lambda^k}{k!}e^{-\lambda} = \sum_{k=0}^\infty \frac{(s\lambda)^k}{k!} e^{-\lambda} = e^{\lambda(s – 1)} \end{align}

The following Result show how Generating functions can be used to calculate the moments of a distribution

\( \frac{dG(1)}{ds} = \mathbb{E}(X) \) and in general \begin{align} \frac{dG^k(1)}{ds^k} = \mathbb{E}[X(X-1)…(X-k+1)] \end{align} .
for \( k > 1 \)
To see this take the kth derivative
\begin{align} \frac{dG^k(s)}{ds^k} & = \sum_{n=k}^{\infty} n(n-1)…(n-k+1)s^{n-k} \mathbb{P}(X=n) \\ &=\mathbb{E}[s^{X-k}X(X-1)…(X-k+1)] \\ \end{align} and evaluate at \( s=1 \).

Technical note: actually you need to let \( s \uparrow 1 \) and then apply Abel’s theorem.

Abel’s Theorem.
Let \( G(s) = \sum_{i=0}^{i=\infty} c_i s^i \) where \( c_i > 0 \) and \(G(s) < \infty \) for \( |s| \lt 1 \) then \( \lim_{s\uparrow 1}G(s) = G(1) \)

A closely related generating function is called the moment generating function.

The Moment Generating function of the non negative random variable, \( X \) is defined to be
\begin{align}
M(t) = \mathbb{E}[e^{tX}] = \sum_k e^{tk} \mathbb{P}(X=k) = \sum_k e^{tk} f(k)
\end{align}

Note that \( M(t) = G(e^t) \)

Using the moment generating function as the name implies is an easier way to get the moments

Taking the nth derivative of the moment generating function of \( X \) and putting \( t= 0 \) gives the nth moment of \( X \) i.e. \( \frac{d M^n(0)}{dt^n} =\mathbb{E}[X^n] \)
Proof.
\begin{align} M(t) &= \sum_{k=0}^\infty e^{tk} \mathbb{P}(X=k) \\ & = \sum_{k=0}^\infty \left(\sum_{n=0}^\infty \frac{(tk)^n}{n!}\right) \mathbb{P}(X = k) \\ & = \sum_{n=0}^\infty \frac{t^n}{n!} \sum_{k=0}^\infty k^n \mathbb{P}(X=k)\\ & = \sum_{n=0}^\infty \frac{t^n}{n!} \mathbb{E}[X^n]\\ \end{align}

Moment generating functions can be defined for more general random variables, not necessarily discrete nor positive.

The Lognormal Distribution

If \( Y \) is a log-normal random variable then \( \ln Y \) is normally distributed.
\( Y = e^{X} \) where \( X \sim N(\mu , \sigma^2) \)
We write \( Y \sim \ln N(\mu, \sigma^2) \)

\begin{align}
P(Y < y) = P(X < \ln y ) = \int_{-\infty}^{\ln y} p(x) dx
\end{align}
Recall
\begin{align}
\frac{d}{dy} \int^{f(y)}_{g(y)} p(x) dx = p(f(y))f^{‘}(y) – p(g(y))g^{‘}(y)
\end{align}
so that
\begin{align}
\frac{d}{dy} P(Y < y) = p(\ln y) \frac{1}{y} = \tilde{p} (y)
\end{align}

where \( \tilde{p} (y) \) is the probability density function for \( Y \)

Mean
\begin{align}
\mathbb{E}[e^X] & =\frac{1}{\sqrt{2\pi \sigma^2}} \int_{-\infty}^{\infty} e^{x} e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx \\
& = \frac{1}{\sqrt{2\pi \sigma^2}} \int_{-\infty}^{\infty}
e^{-\frac{\left(x^2 – 2\mu x + \mu^2 – 2\sigma^2 x\right)}{2\sigma^2}} dx \\
&= \frac{1}{\sqrt{2\pi \sigma^2}} \int_{-\infty}^{\infty}
e^{-\frac{\left( x-(\mu + \sigma^2)\right)^2}{2 \sigma^2}}e^{\frac{\sigma^4+2\mu \sigma^2}{2 \sigma^2}} dx\\
&= e^{\frac{\sigma^4+2\mu \sigma^2}{2 \sigma^2}}\\
&= e^{\frac{\sigma^2+2\mu}{2}}
\end{align}
Variance:
\begin{align}
\mathbb{E}[Y^2] & = \mathbb{E}[e^{2X}] \\
& = \frac{1}{\sqrt{2\pi \sigma^2}} \int_{-\infty}^{\infty} e^{2x} e^{-\frac{\left(x-\mu\right)^2}{2\sigma^2}} dx \\
& = \frac{1}{\sqrt{2\pi \sigma^2}} \int_{-\infty}^{\infty} e^{-\frac{\left(x^2-2(\mu+2\sigma^2)x+\mu^2\right)}{2\sigma^2}} dx \\
& = \frac{1}{\sqrt{2\pi \sigma^2}} \int_{-\infty}^{\infty} e^{-\frac{\left(x-(\mu+2\sigma^2)\right)^2}{2\sigma^2}} e^{\frac{4\sigma^4+4\sigma^2\mu}{2\sigma^2}}dx \\
= e^{2\sigma^2+2\mu}
\end{align}

Alternatively we know \( 2X \sim N(2\mu,4\sigma^2) \) so from earlier \(
\mathbb{E}[e^{2X}] = e^{2\mu +2\sigma^2} \)
The variance is therefore
\begin{align}
\mathbb{E}[Y^2]-\mathbb{E}[Y]^2 &= e^{2\sigma^2+2\mu} – e^{\sigma^2+2\mu}\\
&=e^{2\mu}\left(e^{2\sigma^2}-e^{\sigma^2}\right)
\end{align}

The Normal Distribution

The probability density of a normal random variable with mean \( \mu \) and variance \( \sigma^2 \) is \begin{align} p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \end{align}

One way to show that the probability density integrates to one is to consider the following integral and switch to polar co-oridinates.
\begin{align}
&\int_{-\infty}^{\infty} e^{-\frac{x^2}{2}}\int_{-\infty}^{\infty}e^{-\frac{y^2}{2}} dx dy \\
&\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-\frac{x^2 + y^2}{2}} dx dy
\end{align}
switching to polar co-ordinates \( x = r \cos \theta \), \( y = r \sin \theta \)
We calculate the Jacobian to change integration variables.
\begin{align}
&\left|
\frac{\partial (x,y)}{\partial (r,\theta)}
\right | =
\left |
\begin{array}{cc}
\frac{\partial x}{\partial r}& \frac{\partial y}{\partial r}\\
\frac{\partial x}{\partial \theta} & \frac{\partial y}{\partial \theta}\\
\end{array}
\right | \\
&=\left |
\begin{array}{cc}
\cos \theta & \sin \theta\\
-r \sin \theta & r \cos \theta \\
\end{array}
\right | = r \cos^2 \theta + r \sin^2 \theta = r
\end{align}

\begin{align}
&\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-\frac{x^2 + y^2}{2}} dx dy \\
= &\int_{0}^{\infty} \int_{0}^{2 \pi} \left|
\frac{\partial (x,y)}{\partial (r,\theta)}
\right | e^{-\frac{r^2}{2}} dr d\theta \\
= &\int_{0}^{\infty} \int_{0}^{2 \pi} r e^{-\frac{r^2}{2}} dr d\theta \\
& = 2 \pi [-e^{-\frac{r^2}{2}}]^{\infty}_0 = 2 \pi
\end{align}
so that
\begin{align}
\int_{-\infty}^{\infty} e^{-\frac{x^2}{2}} = \sqrt{2 \pi}
\end{align}
i.e.
\begin{align}
\frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\infty} e^{-\frac{x^2}{2}} dx = 1
\end{align}

Moments of a Normal Distribution.
\begin{align}
\mathbb{E}[X] = & \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\infty} x e^{-\frac{x^2}{2}} dx \\
=& \frac{1}{\sqrt{2 \pi}} \left[-e^{-\frac{x^2}{2}}\right]^{\infty}_{-\infty} = 0
\end{align}
This is also obvious by symmetry. \( p(x) \) is a symmetric function i.e. \( p(x) = p(-x) \) so multiplying p(x) by an odd function \( f(x) \) i.e. \( f(x) = -f(-x) \) and calculating \( \int_{-\infty}^{\infty} p(x) f(x) dx = 0 \).
Since \( \int_{-\infty}^{0} p(x) f(x) dx = -\int^{\infty}_{0} p(x) f(x) dx \). Substitute \( y = -x \) to see this.
\begin{align}
\mathbb{E}[X^2] = &\frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\infty} x^2 e^{-\frac{x^2}{2}} dx \\
&=\frac{1}{\sqrt{2 \pi}} \left ( [- x e^{-\frac{x^2}{2}}]^{\infty}_{-\infty} + \int_{-\infty}^{\infty} e^{-\frac{x^2}{2}} dx \right) = 1
\end{align}

Joint Density Function

For a univariate random variable we have \begin{align} F(x) = P(X< x) = \int_{-\infty}^{\infty} p(x) dx \end{align} and \( \frac{dF}{dx} = p(x) \) the probability density function.

We can extend this definition to more than one random variable \begin{align} F(x,y) = \int_{-\infty}^y \int_{-\infty}^x p(x,y) dx y \end{align} \( F(x,y) = P(X < x, Y < y) \) \( \frac{\partial F}{\partial x \partial y} = p(x,y) \) the probability density function. If we can write \( p(x,y) = f(x) g(y) \) then we say \( X \) and \( Y \) are independent.

Continuous Random Variables

If the random variable \( X \) can take values in an interval \( [a,b] \) then it is a continuous random variable. To describe the probabilities associated with \( X \) we use its probability density function.

$$ P(X \in \delta x) = p(x) \delta x $$

The probability \( X \in A = \int_{A} p(x) dx \)
\begin{align}
\int_{\mathbb{R}} p(x) dx = 1 \\
\text{i.e.} \int_{-\infty}^{\infty} p(x) dx = 1
\end{align}

$$P(X \in [a,b]) = \int_{a}^{b} p(x) dx $$

$$F(a) = P(X < a) = \int_{-\infty}^{a} p(x) dx $$ \( F(a) \) is the cumulative density function \( \frac{dF}{da} = p(a) \), so differentiating the cumulative distribution function gives the probability density function.

Example 1.
The Uniform Distribution
$$ P(X \in [a,b]) = b-a \; 0 \leq a \leq b\leq 1 $$
\begin{align} p(x) = \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{otherwise}\\ \end{cases} \end{align} \begin{align} F(a) = P(X < a) = \int_0^{a} dx = a \end{align} \begin{align} \mathbb{E}[X] = \int_{-\infty}^{\infty} x p(x) dx = [\frac{x^2}{2}]^1_0 = \frac{1}{2}
\end{align} \begin{align} \mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 p(x) dx = [\frac{x^3}{3}]^1_0 = \frac{1}{3} \end{align} \begin{align} \text{Var}[X] = \mathbb{E}[X^2] – \mathbb{E}[X]^2 = \frac{1}{3} – (\frac{1}{2})^2 = \frac{1}{12}
\end{align}

Poisson Process

We can derive the poisson process for which the number of successes in a finite interval is a poisson distribution.

Assumptions.

1. The probability of success happening in a time interval \( \Delta t \), \( p(1,\Delta t) \), is \( \lambda \Delta t \) for small \( \Delta t \) and \( \lambda \in \mathbb{R}^+ \).

2. The probability of \( \gt 1 \) success happening in an interval \( \Delta t \) is negligible so that \( p(0,\Delta t) + p(1, \Delta t) = 1 \)

3. The number of successes in one interval is independent of the number of successes in another.

Derivation.
The probability of no ‘successes’ until time \( t \), denoted by \( P(t) \), is equal to the probability that no ‘successes’ occur until \( t – \Delta t \) followed by no successes in the interval \( \Delta t \)
\begin{align} P(t) &= P(t – \Delta t) p(0, \Delta t) \\ &= P(t – \Delta t) ( 1 – \lambda \Delta t) \end{align} so
\begin{align}
\frac{P(t) – P(t – \Delta t)}{\Delta t} = – \lambda P(t -\Delta t)
\end{align}
Let \( \Delta t \rightarrow 0 \) gives the differential equation
\begin{align}
\frac{d P}{dt} = -\lambda P
\end{align}
The solution to this is
\( P(t) = P(0) e^{-\lambda t} \) and \( P(0) = 1 \) since \( P(t) \) is the probability of no successes until time t and \( P(0) = \lim_{t \rightarrow 0}p(0, \Delta t) = 1 \)
Consider the probability k ‘successes’ occur.
\( P(k,t+\Delta t) = P(k,t)p(0,\Delta t) + P(k-1,t)p(1,\Delta t) \)
\begin{align}
P(k,t + \Delta t) = P(k,t)(1- \lambda \Delta t) + P(k-1,t) \lambda \Delta t
\end{align}
which implies
\begin{align}
\frac{P(k,t+\Delta t) -P(k,t)}{\Delta t} = – \lambda P(k,t) + \lambda P(k-1,t)
\end{align}
Let \( \Delta t \rightarrow 0 \) yields
\begin{align}
\frac{dP(k)}{dt} = – \lambda P(k,t) + \lambda P(k-1,t)
\end{align}
multiplying by the integrating factor \( e^{\lambda t} \) and integrating
\begin{align}
e^{\lambda t}P(k) = \int_0^t \lambda e^{\lambda s} P(k-1) ds + C
\end{align}
\( C = 0\) since \(P(k,0) = 0 \)
from earlier we know \( P(0,s) = e^{-\lambda s} \) so
\begin{align}
e^{\lambda t}P(1) = \int_0^t \lambda ds = \lambda t
\end{align}
and
\begin{align}
e^{\lambda t}P(2) = \int_0^t \lambda^2 s ds = \frac{(\lambda t)^2}{2}
\end{align}
and in general
\begin{align}
e^{\lambda t}P(k) = \int_0^t \lambda^k \frac{s^{k-1}}{(k-1)!} ds = \frac{(\lambda t)^k}{k!}
\end{align}
so that we get the final result
\begin{align}
P(k) = e^{-\lambda t}\frac{(\lambda t)^k}{k!}
\end{align}

Poisson Distribution

\begin{align}
\end{align}

Motivation

Suppose \( X \) follows a binomial distribution \( X \equiv bin(n,p) \) with n large say, 10,000 , and its mean a reasonable value of around 3,4.
\begin{align}
P(X = k) &= \binom{n}{p} p^k (1-p)^{n-k} \\
& = \frac{n!}{k!(n-k)!} p^k (1-p)^{n-k}
\end{align}
Let \( \frac{\lambda}{n} = p \)

Making the following two approximations, for large n
$$ \binom{n}{k} k! \approx n^k $$ and $$ e^{-\lambda} \approx (1-\frac{\lambda}{n})^n$$

\begin{align}
P(X=k) = \frac{\lambda^k}{k!}e^{-\lambda}
\end{align}

Let’s check the probability sums to 1.
\begin{align}
&\sum_{k=0}^{\infty} P(X=k) = \sum_{k=0}^{\infty} \frac{\lambda^k}{k!}e^{-\lambda} \\
&= e^{-\lambda}\sum_{k=0}^{\infty} \frac{\lambda^k}{k!} = e^{-\lambda} .e^{\lambda} = 1
\end{align}

A random variable which satisfied \( P(X=k) = \frac{\lambda^k}{k!}e^{-\lambda}\) is known as a Poisson random variable. \begin{align} \mathbb{E}[X] &= \sum_{k=0}^{\infty} k \frac{\lambda^k e^{-\lambda}}{k!} = \sum_{k=1}^{\infty} \frac{\lambda^k e^{-\lambda}}{(k-1)!}\\ &\lambda \sum_{k=1}^{\infty} \frac{\lambda^{k-1} e^{-\lambda}}{(k-1)!} = \lambda \end{align} \begin{align} \mathbb{E}[X^2]& = \sum_{k=0}^{\infty} k^2 \frac{\lambda^k e^{-\lambda}} {k!} \\ &=\sum_{k=1}^{\infty} (k-1)\frac{\lambda^k e^{-\lambda}} {(k-1)!} + \sum_{k=1}^{\infty} \frac{\lambda^k e^{-\lambda}} {(k-1)!} \\ &= \lambda^2 \sum_{k=2}^{\infty} \frac{\lambda^{k-2} e^{-\lambda}} {(k-2)!} + \lambda \sum_{k=1}^{\infty} \frac{\lambda^{k-1} e^{-\lambda}} {(k-1)!} \\ = \lambda^2 + \lambda \end{align} so \( \text{Var}[X] = \lambda^2 + \lambda -\lambda^2 = \lambda \)

Some Examples of modelling with a Poisson distribution.
1. Number of busses arriving in one hour.
2. Number of defective products produced in a factory in a day.

An experiment which can only have two outcomes is a Bernouilli trial.

Example 1.
X is a random variable which takes the value 1 with probability \( p \) and 0 with probability \( q = 1-p \). Calculate mean and variance. \begin{align} \mathbb{E}[X] = 1. p + 0. (1-p) = p \\ \mathbb{E}[X^2] = 1^2. p + 0. (1-p) = p \\ \end{align} so \( \text{Var}(X) = p -p^2 = p(1-p) = pq \)

If we have a sequence of Bernouilli trials (with the same probability of success \( p \), the number of successes is a Binomial random variable.

Binomial random variable.
\( ( X_1, X_2,…,X_n ) \) are a sequence of independent identically distributed Bernouilli trials then
\( Y = X_1+ X_2+…+X_n \) is a binomial random variable.
We write \( Y = \text{bin}(n,p). Y \) is the number of successs in \( n \) trials. We know the number of ways to choose k successes from n trials is \( \binom{n}{k} \) and the probability of k successes in a particular order is \( p^k (1-p)^{n-k} \) so
$$ P(Y = k) = \binom{n}{p} p^k (1-p)^{n-k} $$

Example 3.
Calculate the mean and variance of Y.
\begin{align}
\mathbb{E}[Y] &= \mathbb{E}[X_1 + X_2 +…+ X_n]] \\
&= n\mathbb{E}[X] = np
\end{align}
\begin{align}
\mathbb{E}[Y^2] &= \mathbb{E}[(X_1+X_2+…+X_n)^2] \\
&=\mathbb{E}[\sum_{i=1}^{i=n} X_i^2 + 2\sum_{i \leq j} X_i X_j] \\ &=\sum_{i=1}^{i=n}\mathbb{E}[X_i^2] + 2 \sum_{i \leq j} \mathbb{E}[X_i X_j] \; \text{using linearity} \\
&= np + 2 \sum_{ i \leq j} p^2 \; \text{since } X_i \text{ and } X_j \; i\neq j \text{ are independent} \\
&= np + n (n-1) p^2 \\
\end{align}
so $$\text{Var}(Y) = np +n(n-1)p^2 -n^2p^2 = np(1-p)=npq $$
Notice this is \(n\text{Var}(X)\) which is what we expect!

Example 4.
Alternative Derivation.
\begin{align}
\mathbb{E}[Y] &= \sum_{k=0}^{k=n} \binom{n}{k} p^k q^{n-k} k \\
&= \sum_{k=0}^{k=n} \binom{n}{k} (\frac{p}{q})^k q^{n} k
\end{align}
Recall binomial theorem
\begin{align}
(1+x)^n = \sum_{k=0}^{k=n} \binom{n}{k} x^k \\
\frac{d }{dx} (1+x)^n = n (1+x)^{n-1} = \sum_{k=0}^{k=n} \binom{n}{k}k x^{k-1}
\end{align}
Let \( x = \frac{p}{q} \) so
\begin{align}
\mathbb{E}[X] &= n q^n\frac{p}{q}(1+\frac{p}{q})^{n-1} \\
&= np (q+p)^{n-1} = np
\end{align}
Similarly
\begin{align}
\frac{d}{dx^2} (1+x)^n = n (n-1) (1+x)^{n-2} = \sum_{k=0}^{k=n} \binom{n}{k}k(k-1)x^{k-2}
\end{align}
so that
\begin{align}
n (n-1) (1+x)^{n-2} x^2 q^n =\sum_{k=0}^{k=n} \binom{n}{k} k^2 x^k q^n -\sum_{k=0}^{k=n} \binom{n}{k} k x^k q^n
\end{align}
and again with \( x = \frac{p}{q} \)
\begin{align}
&\sum_{k=0}^{k=n} \binom{n}{k} k^2 x^k q^n -\sum_{k=0}^{k=n} \binom{n}{k} k x^k q^n \\
=&\sum_{k=0}^{k=n} \binom{n}{k} k^2 (\frac{p}{q})^k q^n -\sum_{k=0}^{k=n} \binom{n}{k} k (\frac{p}{q})^k q^n \\
= &\mathbb{E}[X^2] – \mathbb{E}[X]
\end{align}
so
\begin{align}
n (n-1) (1+\frac{p}{q})^{n-2} (\frac{p}{q})^2 q^n = \mathbb{E}[X^2] – \mathbb{E}[X]
\end{align}
Simplifying
\begin{align}
n(n-1) p^2 = \mathbb{E}[X^2] – \mathbb{E}[X]
\end{align}
\begin{align}
\text{Var}(X) =& \mathbb{E}[X^2] – \mathbb{E}^2[X] \\
&= n(n-1) p^2 + np – n^2p^2 \\
&= np(1-p) = npq
\end{align}

Indicator Functions

The random variable \( \mathbb{1}_A \) takes the value 1 if \( \omega \in A \) and 0 otherwise.
\(\mathbb{E}[\mathbb{1}_A] = P(A)\)

Indicator functions are very useful for computations as the next example will show.

Example 1.
Consider the hat problem we solved using the inclusion exclusion principle. Let \( X \) be the random variable, the number of people who get back their own hat. \( X = 1_{A_1} + 1_{A_2}+…+1_{A_n} \) We can calculate the mean and variance of \(X\) using linearity of the expectation operator. \begin{align} \mathbb{E}[X] = \sum_i^n \mathbb{E}[1_{A_i}] = \sum_{i=1}^n P(A_i) = 1 \end{align} \begin{align} X^2 = \sum_i^n 1_{A_i}^2 + 2\sum_{i < j} 1_{A_i} 1_{A_j} \end{align} so that \begin{align} \mathbb{E}[X^2] &= \sum_i^n P(A_i) + 2\sum_{i < j} P(A_i \cap A_j) \\ &= 1 + 2 \frac{n (n-1)}{2} \frac{1}{n}\frac{1}{n-1} \\ &=2 \end{align} so \(\text{Var}(X) = 1 \)

Inclusion Exclusion Principle

Consider 2 finite sets, \( A \) and \(B\). To calculate the number of items in both sets, we can calculate this by noticing that given that the items in the intersection belong to both A and B then if we add the items in A to the items in B we have double counted the numbers in the intersection so we just need to subtract it.


The diagram shows the (unique) elements in each set represented by blue dots, e.g. there are 7 elements in \( A \) as it contains 7 blue dots. \begin{align}
|A \cup B| = |A| + |B| – |A \cap B|
\end{align}
To see this notice that if an element only belongs to \( A \) then it appears only once in the LHS and RHS, similarly for an element which only belongs to B. If an element belongs to \( A \cap B \) then it appears once in the LHS and for the RHS it appears in \(|A| \), \( |B| \) and subtracted in the \(|A \cap B|\) term (i.e. 1+1-1) times.

A similar result applies to probability and follows from the fundamental properties of probability measure which we look at in more detail in the Stochastic Calculus course

Show that $$P(A \cup B) = P(A)+P(B)-P(A \cap B)$$
Proof.
Firstly we split \( A \) and \( B \) into disjoint sets
\begin{align}
A &= A\backslash B + A \cap B \\
\implies P(A) &= P(A\backslash B) + P(A \cap B) \\
\end{align}
\begin{align}
B &= B\backslash A + A \cap B \\
\implies P(B) &= P(B\backslash A) + P(A \cap B) \\
\end{align}
Also we can split \( A \cup B \) into 3 disjoint sets,
\begin{align}
A \cup B &= A\backslash B + A \cap B + B\backslash A \\
\implies P(A \cup B) &= P(A\backslash B) + P(A \cap B) + P(B\backslash A) \\
\end{align}
From this follows the result.

Now consider adding one more set so we have \( A, B, C \). \begin{align} &|A \cup B \cup C| =\\ &|A| + |B| + |C| – \\ &|A \cap B| – |A \cap C| – |B \cap C| +\\ &|A \cap B \cap C| \end{align} This works as follows. If an item is only in one set, A say, and no other then it is counted exactly once. If an item is in exactly two sets and no other e.g. A and B, then it is counted once in the \( |A| \) term, once in the \( |B| \) and subtracted in the \( |A \cap B |\) term i.e. it is also counted once. Finally if a term is in all three sets then it is counted once in \( |A| \), once in \( |B| \) and once in \( |C| \) ( three times) and then it is subtracted in each of the intersection terms \(|A \cap B|\), \(|A \cap C|\) and \(|B \cap C|\) (three times), so these cancel out, and it is present once in the triple intersection \( |A \cap B \cap C| \)



In general \begin{align*} &P(\cup_{i=1}^{i=n} A_i)= \\ &\sum_{i=1}^{i=n} P(A_i) – \sum_{i_1 < i_2} P(A_{i_1} \cap A_{i_2})\\ &+…+ (-1)^{r+1} \sum_{i_1 <…< i_r} P( A_{i_1} \cap … \cap A_{i_r})\\ &+… + (-1)^{n+1} P( A_{1} \cap … \cap A_{n}) \end{align*}
Proof.
with proceed by using proof by induction. We know the result holds for n = 2 which we showed above. Lets assume that the result holds for all \( n \leq N \) and we will deduce it is true for \( N+1 \). \begin{align} P(\cup_{i=1}^{i=N+1} A_i)&= P(A_{N+1}\backslash \cup_{i=1}^{i=N} A_i) + P(\cup_{i=1}^{i=N} A_i) \text{(disjoint sets)}\\\\ P(A_{N+1}\backslash \cup_{i=1}^{i=N} A_i) &= P(A_{N+1}\backslash \cup_{i=1}^{i=N} A_i \cap A_{N+1}) \\ & = P(A_{N+1}) – P(\cup_{i=1}^{i=N} A_i \cap A_{N+1}) \\\\ \end{align} so that

\( P(\cup_{i=1}^{i=N+1} A_i)= P(A_{N+1}) – P(\cup_{i=1}^{i=N} A_i \cap A_{N+1}) + P(\cup_{i=1}^{i=N} A_i) \)
Expanding the RHS using the expansion (which we assumed true for \( n \leq N \) although a bit tedious the result follows.

Example 1.
This is a classic hat problem which is solved using the inclusion exclusion principle.
n people toss their hats into a bin. You randomly return one hat to each person. Find the probability that nobody gets their own hat back.
Let \( A_i \) be the event that ith person gets his own hat back so we are trying to calculate \( 1 – P(\cup A_i) \)
\begin{align} P(\cup A_i) &= \sum P(A_i) – \sum_{i < j} P(A_i \cup A_j)+…+P(A_1 \cup …\cup A_n) \\ & = n.\frac{1}{n} – \frac{1}{n}\frac{1}{n-1}\binom{n}{2}+\frac{1}{n}\frac{1}{n-1}\frac{1}{n-3}\binom{n}{3}\\ &+…-(-1)^{n}\frac{1}{n!}\\ &=1-\frac{1}{2!}+\frac{1}{3!}-…-(-1)^{n}\frac{1}{n!} \\ & \approx \exp(-1) \text{(taylor expansion of } e^x \text{ about } x=0) \end{align} so \( 1 – P(\cup A_i) \approx 1 – \exp(-1) \)