Bayes Theorem - The Basic Math | Machine/Deep Learning Blog

Introduction

This article is meant to setup the mathematical foundation of Bayesian statistics. Step-wise modeling will be explained in another article.

Set Notation

For discrete sets $A$ and $B$

\[P(A|B) = \frac{P(A \cap B)}{P(B)} \tag{1}\]

For binary A this simplifies to

\[P(A|B) = P(A \cap B)/(P(B|A)*P(A) + P(B|A)*P(A))\] \[P(B|A) = \frac{P(A \cap B)}{P(A)} \implies P(A \cap B) = P(B|A) * P(A) \tag{2}\]

Substituting $2$ in $1$ we get

\[P(A|B) = \frac{P(B|A) * P(A)}{P(B|A) * P(A) + P(B|A') * P(A')}\]

Generalizing this to categorial $A$ with more than two levels:

\[P(A_j|B) = \frac{P(B|A_j) * P(A_j)}{\sum_{\forall i}P(B|A_i) * P(A_i)}\]

Continuous $A$

Continuous form of Bayes theorem in the form of densities is given by:

\[P(A|B) = \frac{P(B|a) * f(a)}{\int P(B|a) * f(a) da}\]

General Note

In commonly used statistical modeling methods such as GLM, we stop with $P(B|A)$ as given by the data — this is the likelihood function. Bayesian modeling allows us to introduce prior beliefs about A into the system either through probability mass or through probability density function.

Posterior Predictive Distribution

Definition

A posterior predictive distribution is the distribution of unobserved values conditioned on observed values.

Mathematical Form

Unobserved Parameter

\[\pi(\theta|y) = \frac{f_{Y|\theta}(y|\theta) \pi(\theta)}{\int_{\Theta}f_{Y|\theta}(y|\theta) \pi(\theta)} \propto f_{Y|\theta}(y|\theta)\pi(\theta)\]

Unobserved Random Variable

\[f(y_2|\theta,y_1) = \int f(y_2|\theta,y_1)f(\theta|y_1)d\theta\]

If $y_2$ and $y_1$ are independent, this simplifies to $f(y_2|\theta,y_1) = \int f(y_2|\theta)f(\theta|y_1)d\theta$

Closing Notes

It is not always possible to obtain an analytical solution for the posterior predictive distribution.
In most practical cases where an analytical solution exists for the posterior predictive distribution, the denominator term is either equal to 1 or does not play a role in determining the type of posterior predictive distribution (this is not always true). Hence only the numerator is retained for further analysis.

Introduction to Bayesian Statistics - Part 1

Introduction to Bayesian Statistics - Part 2