• Source: Bayes classifier
    • In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.


      Definition


      Suppose a pair



      (
      X
      ,
      Y
      )


      {\displaystyle (X,Y)}

      takes values in





      R


      d


      ×
      {
      1
      ,
      2
      ,

      ,
      K
      }


      {\displaystyle \mathbb {R} ^{d}\times \{1,2,\dots ,K\}}

      , where



      Y


      {\displaystyle Y}

      is the class label of an element whose features are given by



      X


      {\displaystyle X}

      . Assume that the conditional distribution of X, given that the label Y takes the value r is given by




      (
      X

      Y
      =
      r
      )


      P

      r




      for


      r
      =
      1
      ,
      2
      ,

      ,
      K


      {\displaystyle (X\mid Y=r)\sim P_{r}\quad {\text{for}}\quad r=1,2,\dots ,K}


      where "






      {\displaystyle \sim }

      " means "is distributed as", and where




      P

      r




      {\displaystyle P_{r}}

      denotes a probability distribution.
      A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function



      C
      :


      R


      d



      {
      1
      ,
      2
      ,

      ,
      K
      }


      {\displaystyle C:\mathbb {R} ^{d}\to \{1,2,\dots ,K\}}

      , with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as






      R


      (
      C
      )
      =
      P

      {
      C
      (
      X
      )

      Y
      }
      .


      {\displaystyle {\mathcal {R}}(C)=\operatorname {P} \{C(X)\neq Y\}.}


      The Bayes classifier is





      C

      Bayes


      (
      x
      )
      =


      argmax

      r

      {
      1
      ,
      2
      ,

      ,
      K
      }



      P

      (
      Y
      =
      r

      X
      =
      x
      )
      .


      {\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r\mid X=x).}


      In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case,



      P

      (
      Y
      =
      r

      X
      =
      x
      )


      {\displaystyle \operatorname {P} (Y=r\mid X=x)}

      . The Bayes classifier is a useful benchmark in statistical classification.
      The excess risk of a general classifier



      C


      {\displaystyle C}

      (possibly depending on some training data) is defined as





      R


      (
      C
      )



      R


      (

      C

      Bayes


      )
      .


      {\displaystyle {\mathcal {R}}(C)-{\mathcal {R}}(C^{\text{Bayes}}).}


      Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.
      Considering the components




      x

      i




      {\displaystyle x_{i}}

      of



      x


      {\displaystyle x}

      to be mutually independent, we get the naive Bayes classifier, where




      C

      Bayes


      (
      x
      )
      =


      argmax

      r

      {
      1
      ,
      2
      ,

      ,
      K
      }



      P

      (
      Y
      =
      r
      )



      i
      =
      1


      d



      P

      r


      (

      x

      i


      )
      .


      {\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r)\prod _{i=1}^{d}P_{r}(x_{i}).}



      Properties


      Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.
      Define the variables: Risk



      R
      (
      h
      )


      {\displaystyle R(h)}

      , Bayes risk




      R






      {\displaystyle R^{*}}

      , all possible classes to which the points can be classified



      Y
      =
      {
      0
      ,
      1
      }


      {\displaystyle Y=\{0,1\}}

      . Let the posterior probability of a point belonging to class 1 be



      η
      (
      x
      )
      =
      P
      r
      (
      Y
      =
      1

      |

      X
      =
      x
      )


      {\displaystyle \eta (x)=Pr(Y=1|X=x)}

      . Define the classifier






      h








      {\displaystyle {\mathcal {h}}^{*}}

      as







      h






      (
      x
      )
      =


      {



      1



      if

      η
      (
      x
      )

      0.5
      ,




      0



      otherwise.









      {\displaystyle {\mathcal {h}}^{*}(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 0.5,\\0&{\text{otherwise.}}\end{cases}}}


      Then we have the following results:

      Proof of (a): For any classifier



      h


      {\displaystyle h}

      , we have








      R
      (
      h
      )



      =


      E


      X
      Y



      [


      I



      {

      h
      (
      X
      )

      Y

      }



      ]







      =


      E


      X




      E


      Y

      |

      X


      [


      I



      {

      h
      (
      X
      )

      Y

      }



      ]






      =


      E


      X


      [
      η
      (
      X
      )


      I



      {

      h
      (
      X
      )
      =
      0

      }



      +
      (
      1

      η
      (
      X
      )
      )


      I



      {

      h
      (
      X
      )
      =
      1

      }



      ]






      {\displaystyle {\begin{aligned}R(h)&=\mathbb {E} _{XY}\left[\mathbb {I} _{\left\{h(X)\neq Y\right\}}\right]\\&=\mathbb {E} _{X}\mathbb {E} _{Y|X}[\mathbb {I} _{\left\{h(X)\neq Y\right\}}]\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}]\end{aligned}}}


      where the second line was derived through Fubini's theorem
      Notice that



      R
      (
      h
      )


      {\displaystyle R(h)}

      is minimised by taking




      x

      X


      {\displaystyle \forall x\in X}

      ,




      h
      (
      x
      )
      =


      {



      1



      if

      η
      (
      x
      )

      1

      η
      (
      x
      )
      ,




      0



      otherwise.









      {\displaystyle h(x)={\begin{cases}1&{\text{if }}\eta (x)\geqslant 1-\eta (x),\\0&{\text{otherwise.}}\end{cases}}}


      Therefore the minimum possible risk is the Bayes risk,




      R




      =
      R
      (

      h




      )


      {\displaystyle R^{*}=R(h^{*})}

      .
      Proof of (b):








      R
      (
      h
      )


      R







      =
      R
      (
      h
      )

      R
      (

      h




      )






      =


      E


      X


      [
      η
      (
      X
      )


      I



      {

      h
      (
      X
      )
      =
      0

      }



      +
      (
      1

      η
      (
      X
      )
      )


      I



      {

      h
      (
      X
      )
      =
      1

      }




      η
      (
      X
      )


      I



      {


      h




      (
      X
      )
      =
      0

      }




      (
      1

      η
      (
      X
      )
      )


      I



      {


      h




      (
      X
      )
      =
      1

      }



      ]






      =


      E


      X


      [

      |

      2
      η
      (
      X
      )

      1

      |



      I



      {

      h
      (
      X
      )


      h




      (
      X
      )

      }



      ]






      =
      2


      E


      X


      [

      |

      η
      (
      X
      )

      0.5

      |



      I



      {

      h
      (
      X
      )


      h




      (
      X
      )

      }



      ]






      {\displaystyle {\begin{aligned}R(h)-R^{*}&=R(h)-R(h^{*})\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}-\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}-(1-\eta (X))\mathbb {I} _{\left\{h^{*}(X)=1\right\}}]\\&=\mathbb {E} _{X}[|2\eta (X)-1|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\\&=2\mathbb {E} _{X}[|\eta (X)-0.5|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\end{aligned}}}


      Proof of (c):








      R
      (

      h




      )



      =


      E


      X


      [
      η
      (
      X
      )


      I



      {


      h




      (
      X
      )
      =
      0

      }



      +
      (
      1

      η
      (
      X
      )
      )


      I



      {

      h

      (
      X
      )
      =
      1

      }



      ]






      =


      E


      X


      [
      min
      (
      η
      (
      X
      )
      ,
      1

      η
      (
      X
      )
      )
      ]






      {\displaystyle {\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h*(X)=1\right\}}]\\&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\end{aligned}}}


      Proof of (d):








      R
      (

      h




      )



      =


      E


      X


      [
      min
      (
      η
      (
      X
      )
      ,
      1

      η
      (
      X
      )
      )
      ]






      =


      1
      2





      E


      X


      [
      max
      (
      η
      (
      X
      )

      1

      /

      2
      ,
      1

      /

      2

      η
      (
      X
      )
      )
      ]






      =


      1
      2





      1
      2



      E

      [

      |

      2
      η
      (
      X
      )

      1

      |

      ]






      {\displaystyle {\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\\&={\frac {1}{2}}-\mathbb {E} _{X}[\max(\eta (X)-1/2,1/2-\eta (X))]\\&={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]\end{aligned}}}



      = General case

      =
      The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.










      E


      Y


      (


      I


      {
      y




      y
      ^



      }


      )



      =


      E


      X




      E


      Y

      |

      X



      (



      I


      {
      y




      y
      ^



      }



      |

      X
      =
      x

      )







      =

      E


      [

      Pr
      (
      Y
      =
      1

      |

      X
      =
      x
      )


      I


      {



      y
      ^



      =
      2
      ,
      3
      ,

      ,
      n
      }


      +
      Pr
      (
      Y
      =
      2

      |

      X
      =
      x
      )


      I


      {



      y
      ^



      =
      1
      ,
      3
      ,

      ,
      n
      }


      +

      +
      Pr
      (
      Y
      =
      n

      |

      X
      =
      x
      )


      I


      {



      y
      ^



      =
      1
      ,
      2
      ,
      3
      ,

      ,
      n

      1
      }



      ]







      {\displaystyle {\begin{aligned}\mathbb {E} _{Y}(\mathbb {I} _{\{y\neq {\hat {y}}\}})&=\mathbb {E} _{X}\mathbb {E} _{Y|X}\left(\mathbb {I} _{\{y\neq {\hat {y}}\}}|X=x\right)\\&=\mathbb {E} \left[\Pr(Y=1|X=x)\mathbb {I} _{\{{\hat {y}}=2,3,\dots ,n\}}+\Pr(Y=2|X=x)\mathbb {I} _{\{{\hat {y}}=1,3,\dots ,n\}}+\dots +\Pr(Y=n|X=x)\mathbb {I} _{\{{\hat {y}}=1,2,3,\dots ,n-1\}}\right]\end{aligned}}}


      This is minimised by simultaneously minimizing all the terms of the expectation using the classifier



      h
      (
      x
      )
      =
      k
      ,

      arg


      max

      k


      P
      r
      (
      Y
      =
      k

      |

      X
      =
      x
      )


      {\displaystyle h(x)=k,\quad \arg \max _{k}Pr(Y=k|X=x)}

      for each observation x.


      See also


      Naive Bayes classifier


      References

    Kata Kunci Pencarian: