Naive Bayes classifier

December 25, 2014 1 分钟阅读

首先感谢张洋先生的这篇算法杂货铺——分类算法之朴素贝叶斯分类(Naive Bayesian classification)，写的非常清楚明白。本文以此为基础做些总结。

Bayes Classifier 在 ISL 里零零散散提到一些，不正式写一下总觉得有点不痛快。

1. Bayes classifierPermalink

首先要说的是 Naive Bayes classifier 只是 Bayes classifier 的一种。Bayes classifier 的定义其实很简单：

C^{Bayes} (x) = \underset{k = {1, 2, \dots, K}}{argmax} P (Y = y_{k} ∣ X = x^{(i)})

在这个大框架下，Bayes classifier 衍生出了很多种，比如：

Naive Bayes classifier
Tree Augmented Naive Bayes classifier (TAN)
Bayesian network Augmented Naive Bayes classifier (BAN)
General Bayesian Network (GBN)

我们这里只讨论 Naive Bayes classifier。

2. Naive Bayes classifierPermalink

按张洋先生的文章，Bayes classifier 的定义可以这么写：

设 $x^{(i)} = x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{n}^{(i)}$ 为一个 test point， $x_{j}^{(i)}$ 表示 $x^{(i)}$ 的 $j^{t h}$ feature 的值。
class (label) 集合 $C = y_{1}, y_{2}, \dots, y_{K}$
我们把 $P r (Y = y_{k} ∣ X = x^{(i)})$ 简写成 $P r (y_{k} | x^{(i)})$
如果 $k^{'} = \underset{k = 1, 2, \dots, K}{argmax} P (y_{k} | x^{(i)})$ ，则把 $x^{(i)}$ 归到 $y_{k^{'}}$ 对应的 class 下

根据 Bayes’ rule，有：

P r (y_{k} | x^{(i)}) = \frac{P r (x^{(i)} | y_{k}) P r (y_{k})}{P r (x^{(i)})}

对 $x^{(i)}$ 本身而言， $P r (x^{(i)})$ 是不变的，于是问题转化成求 $\underset{k = 1, 2, \dots, K}{argmax} P r (x^{(i)} | y_{k}) P r (y_{k})$ 。

假设 feature 之间互相独立，我们可以有：

P r (x^{(i)} | y_{k}) = P r (x_{1}^{(i)} | y_{k}) P r (x_{2}^{(i)} | y_{k}) \dots P r (x_{n}^{(i)} | y_{k}) = \prod_{j = 1}^{n} P r (x_{j}^{(i)} | y_{k})

于是问题转化成求 $\underset{k = 1, 2, \dots, K}{argmax} P r (y_{k}) \prod_{j = 1}^{n} P r (x_{j}^{(i)} | y_{k})$ 。

于是 Naive Bayes classifier 可以定义为

C^{Naive Bayes} (x) = \underset{k = {1, 2, \dots, K}}{argmax} P r (y_{k}) \prod_{j = 1}^{n} P r (x_{j}^{(i)} | y_{k})

3. Parameter Estimation and Event ModelsPermalink

$P r (Y = y_{k})$ 比较好估计，直接计算统计数据就好了，即 $P r (Y = y_{k}) = \frac{# of samples labled y_{k}}{total # of samples}$ 。由于 Naive Bayes classifier 是一种典型的用到大量样本的方法，所以这么搞没问题。

$P r (x_{j}^{(i)} | y_{k})$ 就麻烦一点，根据 Naive Bayes classifier - wikipedia 的说法：

… one must assume a distribution for the features from the training set. The assumptions on distributions of features are called the event model of the Naive Bayes classifier.

for continuous features:
- Gaussian event model
- 统计所有 label 为 $y_{k}$ 的 sample 的 feature $j$ 的值，得到 variance $σ_{j k}^{2}$ 和 mean $μ_{j k}$ ，进而得到一个高斯分布，把 $x_{j}^{(i)}$ 的值带进去计算即可得到概率
for discrete features
- multinomial event model
- Bernoulli event model
- 非常常用的两种 event model，具体自己看 wiki。如果需要深入研究，wiki 后面附了文章专门讨论这两种 event model 在 document classification 应用上的优劣。

4. 样本修正Permalink

另一个需要讨论的问题就是当 $P r (x_{j}^{(i)} | y_{k}) = 0$ 时怎么办。对 continuous feature 来说这个问题很难出现；但对 discrete features 而言，当某个 class 下某个 feature 的某个取值没有出现时，就会产生这种现象，这会影响 classifier 的 performance。为了解决这个问题，我们引入 Laplace 校准，它的思想非常简单，就是对所有的 feature 值的统计量都加 1，这样如果 sample 数量充分大时，并不会对结果产生影响，并且解决了上述概率为 0 的尴尬局面。

5. ExamplePermalink

原文和 wiki 都有，不好理解的时候看看例子就清楚了。

X Facebook LinkedIn Bluesky

Naive Bayes classifier

1. Bayes classifierPermalink

2. Naive Bayes classifierPermalink

3. Parameter Estimation and Event ModelsPermalink

4. 样本修正Permalink

5. ExamplePermalink

分享

留下评论

猜您还喜欢

LL(0) vs. LL(1) Grammars: From Single-String to Flexible Repetition

Lark’s implementation of computing FIRST and FOLLOW sets

LL(1) Parsing

Top-Down Parsers: Recursive Descent, Predictive, and More