/ Notes / Statistics
Bayes Cheatsheet
Express the joint probability of a set of random variables in terms of conditional probabilities
Kumar Shantanu | 2024-07-23

Marginalisation

Marginalization in conditional probability involves summing over the possible values of some variables to obtain the probability distribution of a subset of the variables. This is useful when we are interested in the probability distribution of a subset of variables, without considering the others.

Definition and Formula

Given a joint probability distribution P(X,Y)P(X, Y), marginalization can be used to find the marginal probability distribution of XX by summing over all possible values of YY:

P(X)=yP(X,Y=y)P(X) = \sum_{y} P(X, Y = y)

In the context of conditional probability, marginalization can help us find the marginal conditional probability. For instance, given the joint distribution P(X,Y,Z)P(X, Y, Z), the conditional probability P(XZ)P(X \mid Z) can be obtained by marginalizing over YY:

P(XZ)=yP(X,YZ)P(X \mid Z) = \sum_{y} P(X, Y \mid Z)

Alternatively, if you have the conditional probability P(X,YZ)P(X, Y \mid Z), you can marginalize over YY to find P(XZ)P(X \mid Z):

P(XZ)=yP(XY=y,Z)P(Y=yZ)P(X \mid Z) = \sum_{y} P(X \mid Y = y, Z) P(Y = y \mid Z)

Example

Consider three variables: AA, BB, and CC. Suppose we have the joint probability distribution P(A,B,C)P(A, B, C), and we want to find the marginal conditional probability P(AC)P(A \mid C).

  1. Joint Distribution: P(A,B,C)P(A, B, C)

  2. Marginalize over BB to find P(A,C)P(A, C):

P(A,C)=bP(A,B=b,C)P(A, C) = \sum_{b} P(A, B = b, C)
  1. Normalize to get P(AC)P(A \mid C):
P(AC)=P(A,C)P(C)P(A \mid C) = \frac{P(A, C)}{P(C)}

However, using conditional probabilities directly, we can marginalize over BB:

P(AC)=bP(AB=b,C)P(B=bC)P(A \mid C) = \sum_{b} P(A \mid B = b, C) P(B = b \mid C)

Chain Rule for Joint Distributions

The chain rule of probability, also known as the chain rule for joint distributions, is a method to express the joint probability of a set of random variables in terms of conditional probabilities.

For a set of nn random variables X1,X2,,XnX_1, X_2, \ldots, X_n, the chain rule of probability states that the joint probability P(X1,X2,,Xn)P(X_1, X_2, \ldots, X_n) can be factored as:

P(X1,X2,,Xn)=P(X1)P(X2X1)P(X3X1,X2)P(XnX1,X2,,Xn1)P(X_1, X_2, \ldots, X_n) = P(X_1) \cdot P(X_2 \mid X_1) \cdot P(X_3 \mid X_1, X_2) \cdots P(X_n \mid X_1, X_2, \ldots, X_{n-1})

In general form, this can be written as:

P(X1,X2,,Xn)=i=1nP(XiX1,X2,,Xi1)P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i \mid X_1, X_2, \ldots, X_{i-1})

This formula allows you to break down the joint probability into a product of conditional probabilities, which can be easier to handle, especially when dealing with complex distributions or large numbers of variables.

Example

For three random variables X,Y,ZX, Y, Z, the chain rule can be written as:

P(X,Y,Z)=P(X)P(YX)P(ZX,Y)P(X, Y, Z) = P(X) \cdot P(Y \mid X) \cdot P(Z \mid X, Y)

Each term in this product represents a conditional probability, showing the dependency of each variable on the preceding ones.

Use in Bayesian Networks

In Bayesian networks, the chain rule is often used to represent the joint probability distribution over the network's variables by considering the network's structure (i.e., the dependencies among the variables).

Factor Conditioning

Factor conditioning refers to the process of conditioning a probability distribution on a subset of its variables, resulting in a new distribution over the remaining variables.

Example with Joint Probability Distribution

Let's consider a simple example with three random variables: AA, BB, and CC. Suppose we have the joint probability distribution P(A,B,C)P(A, B, C).

Step-by-Step Factor Conditioning

  1. Original Joint Distribution:

    The joint distribution P(A,B,C)P(A, B, C) can be factored into conditional probabilities:

P(A,B,C)=P(A)P(BA)P(CA,B) P(A, B, C) = P(A) \cdot P(B \mid A) \cdot P(C \mid A, B)
  1. Conditioning on a Variable: Suppose we want to condition on A=aA = a. This means we are interested in the distribution of BB and CC given A=aA = a.

  2. Conditional Distribution: The new distribution after conditioning on A=aA = a is:

    P(B,CA=a) P(B, C \mid A = a)

    This can be computed using the original factors:

P(B,CA=a)=P(A=a,B,C)P(A=a) P(B, C \mid A = a) = \frac{P(A = a, B, C)}{P(A = a)}

Using the factorization of the joint distribution, we get:

P(B,CA=a)=P(A=a)P(BA=a)P(CA=a,B)P(A=a)P(B, C \mid A = a) = \frac{P(A = a) \cdot P(B \mid A = a) \cdot P(C \mid A = a, B)}{P(A = a)}

Simplifying, we get:

P(B,CA=a)=P(BA=a)P(CA=a,B)P(B, C \mid A = a) = P(B \mid A = a) \cdot P(C \mid A = a, B)