- Source: Chain rule (probability)
In probability theory, the chain rule (also called the general product rule) describes how to calculate the probability of the intersection of, not necessarily independent, events or the joint distribution of random variables respectively, using conditional probabilities. This rule allows one to express a joint probability in terms of only conditional probabilities. The rule is notably used in the context of discrete stochastic processes and in applications, e.g. the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.
Chain rule for events
= Two events
=For two events
A
{\displaystyle A}
and
B
{\displaystyle B}
, the chain rule states that
P
(
A
∩
B
)
=
P
(
B
∣
A
)
P
(
A
)
{\displaystyle \mathbb {P} (A\cap B)=\mathbb {P} (B\mid A)\mathbb {P} (A)}
,
where
P
(
B
∣
A
)
{\displaystyle \mathbb {P} (B\mid A)}
denotes the conditional probability of
B
{\displaystyle B}
given
A
{\displaystyle A}
.
Example
An Urn A has 1 black ball and 2 white balls and another Urn B has 1 black ball and 3 white balls. Suppose we pick an urn at random and then select a ball from that urn. Let event
A
{\displaystyle A}
be choosing the first urn, i.e.
P
(
A
)
=
P
(
A
¯
)
=
1
/
2
{\displaystyle \mathbb {P} (A)=\mathbb {P} ({\overline {A}})=1/2}
, where
A
¯
{\displaystyle {\overline {A}}}
is the complementary event of
A
{\displaystyle A}
. Let event
B
{\displaystyle B}
be the chance we choose a white ball. The chance of choosing a white ball, given that we have chosen the first urn, is
P
(
B
|
A
)
=
2
/
3.
{\displaystyle \mathbb {P} (B|A)=2/3.}
The intersection
A
∩
B
{\displaystyle A\cap B}
then describes choosing the first urn and a white ball from it. The probability can be calculated by the chain rule as follows:
P
(
A
∩
B
)
=
P
(
B
∣
A
)
P
(
A
)
=
2
3
⋅
1
2
=
1
3
.
{\displaystyle \mathbb {P} (A\cap B)=\mathbb {P} (B\mid A)\mathbb {P} (A)={\frac {2}{3}}\cdot {\frac {1}{2}}={\frac {1}{3}}.}
= Finitely many events
=For events
A
1
,
…
,
A
n
{\displaystyle A_{1},\ldots ,A_{n}}
whose intersection has not probability zero, the chain rule states
P
(
A
1
∩
A
2
∩
…
∩
A
n
)
=
P
(
A
n
∣
A
1
∩
…
∩
A
n
−
1
)
P
(
A
1
∩
…
∩
A
n
−
1
)
=
P
(
A
n
∣
A
1
∩
…
∩
A
n
−
1
)
P
(
A
n
−
1
∣
A
1
∩
…
∩
A
n
−
2
)
P
(
A
1
∩
…
∩
A
n
−
2
)
=
P
(
A
n
∣
A
1
∩
…
∩
A
n
−
1
)
P
(
A
n
−
1
∣
A
1
∩
…
∩
A
n
−
2
)
⋅
…
⋅
P
(
A
3
∣
A
1
∩
A
2
)
P
(
A
2
∣
A
1
)
P
(
A
1
)
=
P
(
A
1
)
P
(
A
2
∣
A
1
)
P
(
A
3
∣
A
1
∩
A
2
)
⋅
…
⋅
P
(
A
n
∣
A
1
∩
⋯
∩
A
n
−
1
)
=
∏
k
=
1
n
P
(
A
k
∣
A
1
∩
⋯
∩
A
k
−
1
)
=
∏
k
=
1
n
P
(
A
k
|
⋂
j
=
1
k
−
1
A
j
)
.
{\displaystyle {\begin{aligned}\mathbb {P} \left(A_{1}\cap A_{2}\cap \ldots \cap A_{n}\right)&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{1}\cap \ldots \cap A_{n-1}\right)\\&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{n-1}\mid A_{1}\cap \ldots \cap A_{n-2}\right)\mathbb {P} \left(A_{1}\cap \ldots \cap A_{n-2}\right)\\&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{n-1}\mid A_{1}\cap \ldots \cap A_{n-2}\right)\cdot \ldots \cdot \mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{1})\\&=\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\cdot \ldots \cdot \mathbb {P} (A_{n}\mid A_{1}\cap \dots \cap A_{n-1})\\&=\prod _{k=1}^{n}\mathbb {P} (A_{k}\mid A_{1}\cap \dots \cap A_{k-1})\\&=\prod _{k=1}^{n}\mathbb {P} \left(A_{k}\,{\Bigg |}\,\bigcap _{j=1}^{k-1}A_{j}\right).\end{aligned}}}
Example 1
For
n
=
4
{\displaystyle n=4}
, i.e. four events, the chain rule reads
P
(
A
1
∩
A
2
∩
A
3
∩
A
4
)
=
P
(
A
4
∣
A
3
∩
A
2
∩
A
1
)
P
(
A
3
∩
A
2
∩
A
1
)
=
P
(
A
4
∣
A
3
∩
A
2
∩
A
1
)
P
(
A
3
∣
A
2
∩
A
1
)
P
(
A
2
∩
A
1
)
=
P
(
A
4
∣
A
3
∩
A
2
∩
A
1
)
P
(
A
3
∣
A
2
∩
A
1
)
P
(
A
2
∣
A
1
)
P
(
A
1
)
{\displaystyle {\begin{aligned}\mathbb {P} (A_{1}\cap A_{2}\cap A_{3}\cap A_{4})&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\cap A_{2}\cap A_{1})\\&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\mid A_{2}\cap A_{1})\mathbb {P} (A_{2}\cap A_{1})\\&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\mid A_{2}\cap A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{1})\end{aligned}}}
.
Example 2
We randomly draw 4 cards (one at a time) without replacement from deck with 52 cards. What is the probability that we have picked 4 aces?
First, we set
A
n
:=
{
draw an ace in the
n
th
try
}
{\textstyle A_{n}:=\left\{{\text{draw an ace in the }}n^{\text{th}}{\text{ try}}\right\}}
. Obviously, we get the following probabilities
P
(
A
1
)
=
4
52
,
P
(
A
2
∣
A
1
)
=
3
51
,
P
(
A
3
∣
A
1
∩
A
2
)
=
2
50
,
P
(
A
4
∣
A
1
∩
A
2
∩
A
3
)
=
1
49
{\displaystyle \mathbb {P} (A_{1})={\frac {4}{52}},\qquad \mathbb {P} (A_{2}\mid A_{1})={\frac {3}{51}},\qquad \mathbb {P} (A_{3}\mid A_{1}\cap A_{2})={\frac {2}{50}},\qquad \mathbb {P} (A_{4}\mid A_{1}\cap A_{2}\cap A_{3})={\frac {1}{49}}}
.
Applying the chain rule,
P
(
A
1
∩
A
2
∩
A
3
∩
A
4
)
=
4
52
⋅
3
51
⋅
2
50
⋅
1
49
=
24
6497400
{\displaystyle \mathbb {P} (A_{1}\cap A_{2}\cap A_{3}\cap A_{4})={\frac {4}{52}}\cdot {\frac {3}{51}}\cdot {\frac {2}{50}}\cdot {\frac {1}{49}}={\frac {24}{6497400}}}
.
= Statement of the theorem and proof
=Let
(
Ω
,
A
,
P
)
{\displaystyle (\Omega ,{\mathcal {A}},\mathbb {P} )}
be a probability space. Recall that the conditional probability of an
A
∈
A
{\displaystyle A\in {\mathcal {A}}}
given
B
∈
A
{\displaystyle B\in {\mathcal {A}}}
is defined as
P
(
A
∣
B
)
:=
{
P
(
A
∩
B
)
P
(
B
)
,
P
(
B
)
>
0
,
0
P
(
B
)
=
0.
{\displaystyle {\begin{aligned}\mathbb {P} (A\mid B):={\begin{cases}{\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (B)}},&\mathbb {P} (B)>0,\\0&\mathbb {P} (B)=0.\end{cases}}\end{aligned}}}
Then we have the following theorem.
Chain rule for discrete random variables
= Two random variables
=For two discrete random variables
X
,
Y
{\displaystyle X,Y}
, we use the events
A
:=
{
X
=
x
}
{\displaystyle A:=\{X=x\}}
and
B
:=
{
Y
=
y
}
{\displaystyle B:=\{Y=y\}}
in the definition above, and find the joint distribution as
P
(
X
=
x
,
Y
=
y
)
=
P
(
X
=
x
∣
Y
=
y
)
P
(
Y
=
y
)
,
{\displaystyle \mathbb {P} (X=x,Y=y)=\mathbb {P} (X=x\mid Y=y)\mathbb {P} (Y=y),}
or
P
(
X
,
Y
)
(
x
,
y
)
=
P
X
∣
Y
(
x
∣
y
)
P
Y
(
y
)
,
{\displaystyle \mathbb {P} _{(X,Y)}(x,y)=\mathbb {P} _{X\mid Y}(x\mid y)\mathbb {P} _{Y}(y),}
where
P
X
(
x
)
:=
P
(
X
=
x
)
{\displaystyle \mathbb {P} _{X}(x):=\mathbb {P} (X=x)}
is the probability distribution of
X
{\displaystyle X}
and
P
X
∣
Y
(
x
∣
y
)
{\displaystyle \mathbb {P} _{X\mid Y}(x\mid y)}
conditional probability distribution of
X
{\displaystyle X}
given
Y
{\displaystyle Y}
.
= Finitely many random variables
=Let
X
1
,
…
,
X
n
{\displaystyle X_{1},\ldots ,X_{n}}
be random variables and
x
1
,
…
,
x
n
∈
R
{\displaystyle x_{1},\dots ,x_{n}\in \mathbb {R} }
. By the definition of the conditional probability,
P
(
X
n
=
x
n
,
…
,
X
1
=
x
1
)
=
P
(
X
n
=
x
n
|
X
n
−
1
=
x
n
−
1
,
…
,
X
1
=
x
1
)
P
(
X
n
−
1
=
x
n
−
1
,
…
,
X
1
=
x
1
)
{\displaystyle \mathbb {P} \left(X_{n}=x_{n},\ldots ,X_{1}=x_{1}\right)=\mathbb {P} \left(X_{n}=x_{n}|X_{n-1}=x_{n-1},\ldots ,X_{1}=x_{1}\right)\mathbb {P} \left(X_{n-1}=x_{n-1},\ldots ,X_{1}=x_{1}\right)}
and using the chain rule, where we set
A
k
:=
{
X
k
=
x
k
}
{\displaystyle A_{k}:=\{X_{k}=x_{k}\}}
, we can find the joint distribution as
P
(
X
1
=
x
1
,
…
X
n
=
x
n
)
=
P
(
X
1
=
x
1
∣
X
2
=
x
2
,
…
,
X
n
=
x
n
)
P
(
X
2
=
x
2
,
…
,
X
n
=
x
n
)
=
P
(
X
1
=
x
1
)
P
(
X
2
=
x
2
∣
X
1
=
x
1
)
P
(
X
3
=
x
3
∣
X
1
=
x
1
,
X
2
=
x
2
)
⋅
…
⋅
P
(
X
n
=
x
n
∣
X
1
=
x
1
,
…
,
X
n
−
1
=
x
n
−
1
)
{\displaystyle {\begin{aligned}\mathbb {P} \left(X_{1}=x_{1},\ldots X_{n}=x_{n}\right)&=\mathbb {P} \left(X_{1}=x_{1}\mid X_{2}=x_{2},\ldots ,X_{n}=x_{n}\right)\mathbb {P} \left(X_{2}=x_{2},\ldots ,X_{n}=x_{n}\right)\\&=\mathbb {P} (X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2}\mid X_{1}=x_{1})\mathbb {P} (X_{3}=x_{3}\mid X_{1}=x_{1},X_{2}=x_{2})\cdot \ldots \\&\qquad \cdot \mathbb {P} (X_{n}=x_{n}\mid X_{1}=x_{1},\dots ,X_{n-1}=x_{n-1})\\\end{aligned}}}
= Example
=For
n
=
3
{\displaystyle n=3}
, i.e. considering three random variables. Then, the chain rule reads
P
(
X
1
,
X
2
,
X
3
)
(
x
1
,
x
2
,
x
3
)
=
P
(
X
1
=
x
1
,
X
2
=
x
2
,
X
3
=
x
3
)
=
P
(
X
3
=
x
3
∣
X
2
=
x
2
,
X
1
=
x
1
)
P
(
X
2
=
x
2
,
X
1
=
x
1
)
=
P
(
X
3
=
x
3
∣
X
2
=
x
2
,
X
1
=
x
1
)
P
(
X
2
=
x
2
∣
X
1
=
x
1
)
P
(
X
1
=
x
1
)
=
P
X
3
∣
X
2
,
X
1
(
x
3
∣
x
2
,
x
1
)
P
X
2
∣
X
1
(
x
2
∣
x
1
)
P
X
1
(
x
1
)
.
{\displaystyle {\begin{aligned}\mathbb {P} _{(X_{1},X_{2},X_{3})}(x_{1},x_{2},x_{3})&=\mathbb {P} (X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3})\\&=\mathbb {P} (X_{3}=x_{3}\mid X_{2}=x_{2},X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2},X_{1}=x_{1})\\&=\mathbb {P} (X_{3}=x_{3}\mid X_{2}=x_{2},X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2}\mid X_{1}=x_{1})\mathbb {P} (X_{1}=x_{1})\\&=\mathbb {P} _{X_{3}\mid X_{2},X_{1}}(x_{3}\mid x_{2},x_{1})\mathbb {P} _{X_{2}\mid X_{1}}(x_{2}\mid x_{1})\mathbb {P} _{X_{1}}(x_{1}).\end{aligned}}}
Bibliography
René L. Schilling (2021), Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum (1 ed.), Technische Universität Dresden, Germany, ISBN 979-8-5991-0488-9{{citation}}: CS1 maint: location missing publisher (link)
William Feller (1968), An Introduction to Probability Theory and Its Applications, vol. I (3 ed.), New York / London / Sydney: Wiley, ISBN 978-0-471-25708-0
Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, ISBN 0-13-790395-2, p. 496.
References
Kata Kunci Pencarian:
- Faktorial
- Chain rule (probability)
- Conditional probability
- Chain rule (disambiguation)
- Law of total probability
- Markov chain
- Markov chain Monte Carlo
- Bayes' theorem
- Bayesian probability
- Cromwell's rule
- Probability axioms