- Source: Bayes classifier
- Naive Bayes classifier
- Pengenalan pola
- Satuan pengamanan
- Klasifikasi dokumen
- Model generatif
- Analitik prediktif
- Pemelajaran mesin berbasis aturan
- Naive Bayes classifier
- Bayes classifier
- Ensemble learning
- Bayes error rate
- List of things named after Thomas Bayes
- Linear classifier
- Generative model
- Naive Bayes spam filtering
- Multinomial logistic regression
- Classification rule
In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.
Definition
Suppose a pair
(
X
,
Y
)
{\displaystyle (X,Y)}
takes values in
R
d
×
{
1
,
2
,
…
,
K
}
{\displaystyle \mathbb {R} ^{d}\times \{1,2,\dots ,K\}}
, where
Y
{\displaystyle Y}
is the class label of an element whose features are given by
X
{\displaystyle X}
. Assume that the conditional distribution of X, given that the label Y takes the value r is given by
(
X
∣
Y
=
r
)
∼
P
r
for
r
=
1
,
2
,
…
,
K
{\displaystyle (X\mid Y=r)\sim P_{r}\quad {\text{for}}\quad r=1,2,\dots ,K}
where "
∼
{\displaystyle \sim }
" means "is distributed as", and where
P
r
{\displaystyle P_{r}}
denotes a probability distribution.
A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function
C
:
R
d
→
{
1
,
2
,
…
,
K
}
{\displaystyle C:\mathbb {R} ^{d}\to \{1,2,\dots ,K\}}
, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as
R
(
C
)
=
P
{
C
(
X
)
≠
Y
}
.
{\displaystyle {\mathcal {R}}(C)=\operatorname {P} \{C(X)\neq Y\}.}
The Bayes classifier is
C
Bayes
(
x
)
=
argmax
r
∈
{
1
,
2
,
…
,
K
}
P
(
Y
=
r
∣
X
=
x
)
.
{\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r\mid X=x).}
In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case,
P
(
Y
=
r
∣
X
=
x
)
{\displaystyle \operatorname {P} (Y=r\mid X=x)}
. The Bayes classifier is a useful benchmark in statistical classification.
The excess risk of a general classifier
C
{\displaystyle C}
(possibly depending on some training data) is defined as
R
(
C
)
−
R
(
C
Bayes
)
.
{\displaystyle {\mathcal {R}}(C)-{\mathcal {R}}(C^{\text{Bayes}}).}
Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.
Considering the components
x
i
{\displaystyle x_{i}}
of
x
{\displaystyle x}
to be mutually independent, we get the naive Bayes classifier, where
C
Bayes
(
x
)
=
argmax
r
∈
{
1
,
2
,
…
,
K
}
P
(
Y
=
r
)
∏
i
=
1
d
P
r
(
x
i
)
.
{\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r)\prod _{i=1}^{d}P_{r}(x_{i}).}
Properties
Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.
Define the variables: Risk
R
(
h
)
{\displaystyle R(h)}
, Bayes risk
R
∗
{\displaystyle R^{*}}
, all possible classes to which the points can be classified
Y
=
{
0
,
1
}
{\displaystyle Y=\{0,1\}}
. Let the posterior probability of a point belonging to class 1 be
η
(
x
)
=
P
r
(
Y
=
1
|
X
=
x
)
{\displaystyle \eta (x)=Pr(Y=1|X=x)}
. Define the classifier
h
∗
{\displaystyle {\mathcal {h}}^{*}}
as
h
∗
(
x
)
=
{
1
if
η
(
x
)
⩾
0.5
,
0
otherwise.
{\displaystyle {\mathcal {h}}^{*}(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 0.5,\\0&{\text{otherwise.}}\end{cases}}}
Then we have the following results:
Proof of (a): For any classifier
h
{\displaystyle h}
, we have
R
(
h
)
=
E
X
Y
[
I
{
h
(
X
)
≠
Y
}
]
=
E
X
E
Y
|
X
[
I
{
h
(
X
)
≠
Y
}
]
=
E
X
[
η
(
X
)
I
{
h
(
X
)
=
0
}
+
(
1
−
η
(
X
)
)
I
{
h
(
X
)
=
1
}
]
{\displaystyle {\begin{aligned}R(h)&=\mathbb {E} _{XY}\left[\mathbb {I} _{\left\{h(X)\neq Y\right\}}\right]\\&=\mathbb {E} _{X}\mathbb {E} _{Y|X}[\mathbb {I} _{\left\{h(X)\neq Y\right\}}]\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}]\end{aligned}}}
where the second line was derived through Fubini's theorem
Notice that
R
(
h
)
{\displaystyle R(h)}
is minimised by taking
∀
x
∈
X
{\displaystyle \forall x\in X}
,
h
(
x
)
=
{
1
if
η
(
x
)
⩾
1
−
η
(
x
)
,
0
otherwise.
{\displaystyle h(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 1-\eta (x),\\0&{\text{otherwise.}}\end{cases}}}
Therefore the minimum possible risk is the Bayes risk,
R
∗
=
R
(
h
∗
)
{\displaystyle R^{*}=R(h^{*})}
.
Proof of (b):
R
(
h
)
−
R
∗
=
R
(
h
)
−
R
(
h
∗
)
=
E
X
[
η
(
X
)
I
{
h
(
X
)
=
0
}
+
(
1
−
η
(
X
)
)
I
{
h
(
X
)
=
1
}
−
η
(
X
)
I
{
h
∗
(
X
)
=
0
}
−
(
1
−
η
(
X
)
)
I
{
h
∗
(
X
)
=
1
}
]
=
E
X
[
|
2
η
(
X
)
−
1
|
I
{
h
(
X
)
≠
h
∗
(
X
)
}
]
=
2
E
X
[
|
η
(
X
)
−
0.5
|
I
{
h
(
X
)
≠
h
∗
(
X
)
}
]
{\displaystyle {\begin{aligned}R(h)-R^{*}&=R(h)-R(h^{*})\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}-\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}-(1-\eta (X))\mathbb {I} _{\left\{h^{*}(X)=1\right\}}]\\&=\mathbb {E} _{X}[|2\eta (X)-1|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\\&=2\mathbb {E} _{X}[|\eta (X)-0.5|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\end{aligned}}}
Proof of (c):
R
(
h
∗
)
=
E
X
[
η
(
X
)
I
{
h
∗
(
X
)
=
0
}
+
(
1
−
η
(
X
)
)
I
{
h
∗
(
X
)
=
1
}
]
=
E
X
[
min
(
η
(
X
)
,
1
−
η
(
X
)
)
]
{\displaystyle {\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h*(X)=1\right\}}]\\&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\end{aligned}}}
Proof of (d):
R
(
h
∗
)
=
E
X
[
min
(
η
(
X
)
,
1
−
η
(
X
)
)
]
=
1
2
−
E
X
[
max
(
η
(
X
)
−
1
/
2
,
1
/
2
−
η
(
X
)
)
]
=
1
2
−
1
2
E
[
|
2
η
(
X
)
−
1
|
]
{\displaystyle {\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\\&={\frac {1}{2}}-\mathbb {E} _{X}[\max(\eta (X)-1/2,1/2-\eta (X))]\\&={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]\end{aligned}}}
= General case
=The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.
E
Y
(
I
{
y
≠
y
^
}
)
=
E
X
E
Y
|
X
(
I
{
y
≠
y
^
}
|
X
=
x
)
=
E
[
Pr
(
Y
=
1
|
X
=
x
)
I
{
y
^
=
2
,
3
,
…
,
n
}
+
Pr
(
Y
=
2
|
X
=
x
)
I
{
y
^
=
1
,
3
,
…
,
n
}
+
⋯
+
Pr
(
Y
=
n
|
X
=
x
)
I
{
y
^
=
1
,
2
,
3
,
…
,
n
−
1
}
]
{\displaystyle {\begin{aligned}\mathbb {E} _{Y}(\mathbb {I} _{\{y\neq {\hat {y}}\}})&=\mathbb {E} _{X}\mathbb {E} _{Y|X}\left(\mathbb {I} _{\{y\neq {\hat {y}}\}}|X=x\right)\\&=\mathbb {E} \left[\Pr(Y=1|X=x)\mathbb {I} _{\{{\hat {y}}=2,3,\dots ,n\}}+\Pr(Y=2|X=x)\mathbb {I} _{\{{\hat {y}}=1,3,\dots ,n\}}+\dots +\Pr(Y=n|X=x)\mathbb {I} _{\{{\hat {y}}=1,2,3,\dots ,n-1\}}\right]\end{aligned}}}
This is minimised by simultaneously minimizing all the terms of the expectation using the classifier
h
(
x
)
=
k
,
arg
max
k
P
r
(
Y
=
k
|
X
=
x
)
{\displaystyle h(x)=k,\quad \arg \max _{k}Pr(Y=k|X=x)}
for each observation x.
See also
Naive Bayes classifier