Naive Bayes Classifier

April 11, 2019 5 分钟阅读

1. Generative vs DiscriminativePermalink

参考 Mihaela van der Schaar: Generative vs. Discriminative Models, Maximum Likelihood Estimation, Mixture Models:

Generative Model $\Rightarrow$ tries to learn $P_{Y, X} (y, x)$
- 这两条路线都可以走：
  - $P_{Y, X} (y, x) = P_{Y | X} (y | x) P_{X} (x)$
  - $P_{Y, X} (y, x) = P_{X | Y} (x | y) P_{Y} (y)$
- Explicitly models the distribution of both the features and the corresponding labels (classes)
- Aims to explain the generation of all data
- Example techniques:
  - Naive Bayes Classifier
  - Hidden Markov Models (HMM)
  - Gaussian Mixture Models (GMM)
  - Multinomial Mixture Models
Discriminative Model $\Rightarrow$ tries to learn $P_{Y | X} (y | x)$
- Aims to predict relevant data
- Example techniques:
  - $K$ nearest neighbors
  - logistic regression
  - linear regression
    - 没错，linear regression 其实是 discriminative model
    - 我觉得这就是 discriminative 这个名字不好的地方
  - Conditional Random Fields (CRFs)
    - Logistic Regression is the simplest CRF
  - SVMs
  - perceptrons

2. Frequentist vs BayesianPermalink

Section 5.6 Bayesian Statistics, Deep Learning 上说：

As discussed in section 5.4.1, the frequentist perspective is that the true parameter value $θ^{*}$ is fixed but unknown, while the point estimate ${\hat{Θ}}_{m}$ is a random variable on account of it being a function of the dataset (which is seen as random).

The Bayesian perspective on statistics is quite different. The Bayesian uses probability to reflect degrees of certainty of states of knowledge. ~~The dataset is directly observed and so is not random~~. On the other hand, the true parameter $θ^{*}$ is unknown or uncertain and thus is represented as a random variable.

但是，从我找到的其他材料，以及 Deep Learning 后面自己的 Example: Bayesian Linear Regression 小节来看，我并没有看出 Bayesian machine learning 在 model 的时候有把 (training) dataset 看做 observed。所以我觉得 Frequentist vs Bayesian machine learning 最大的一点区别就在于：

Frequentist $\Rightarrow$ the true, unknown parameter $θ^{*}$ is a value
- 所以 Frequentist machine learning $\Rightarrow P_{D} (d; θ)$
  - $θ$ is not modeled probabilistically
Bayesian $\Rightarrow$ the true, unknown parameter $θ^{*}$ is a random variable
- 所以 Bayesian machine learning $\Rightarrow P_{D, Θ} (d, θ)$
  - $θ$ is modeled probabilistically
- 这个 $θ$ 可以是 latent variable

具体处理起来的话，一般的做法是：

Frequentist $\Rightarrow$
- 写出 $P_{D} (D; θ) = \prod_{i = 1}^{m} P_{D_{i}} (d_{i}; θ)$ 的表达式
- 做 point estimate $\hat{θ} \to θ^{*}$ 使得 $P_{D} (θ) \to P_{D}^{*}$
- 对 test data 做 prediction： $d_{m + 1} = E [P_{D_{m + 1}} (d_{m + 1}; \hat{θ})]$
Bayesian $\Rightarrow$
- 变形 $P_{Θ | D} (θ | D) = \frac{P_{D | Θ} (D | θ) P_{Θ} (θ)}{P_{D} (D)}$
  - 或者用 $P_{Θ | D} (θ | D) \propto P_{D | Θ} (D | θ) P_{Θ} (θ)$ 做 MAP
  - 注意：做 MAP 会让人觉得这很像是 Frequentist，但注意 Bayesian 的主要特征其实是变形
- 对 test data 做 prediction： $P (D_{m + 1} | D) = \int P (D_{m + 1} | θ) P (θ | D) d θ$
  - 可以得到 $D_{m + 1} | D$ 的 distribution

注意这里的 $D = {D_{1}, \dots, D_{m}}$ 表示 (training) dataset，看做是 1 个 sample、 $m$ 个 random variable。 $D_{i}$ realized 得到一个具体的 data point $d_{i}$ 。非常重要的一点：

这个 $D$ ，它既可以表示 $D = Y, X$ ，也可以表示 $D = Y | X$ ，完全看你自己的需求
也就是说：无论是 Frequentist 还是 Bayesian machine learning， $D$ 的形式确定了你到底是 Generative 还是 Discriminative model

3. Generative vs Discriminative, Frequentist vs BayesianPermalink

所以这两种划分是不冲突的，我们完全可以做一个 $2 \times 2$ 的 table (参考 Generative vs. Discriminative; Bayesian vs. Frequentist)：

	Frequentist	Bayesian
Discriminative	$P_{Y \| X} (y \| x; θ)$	$P_{(Y \| X), Θ} ((y \| x), θ)$
Generative	$P_{Y, X} (y, x; θ)$	$P_{Y, X, Θ} (y, x, θ)$

Items to the right of the semicolon (;) are not modeled probabilistically
注意符号：
- $P_{Y | X, Θ}$ 表示 “distribution of $Y$ , conditioned on $X$ and $Θ$ ”
- $P_{(Y | X), Θ}$ 表示 “joint distribution of $Y | X$ and $θ$ ”

4. Unsupervised vs SupervisedPermalink

明显可以看出，无论是 generative 还是 discriminative，它们都是 supervised learning 的范畴，因为它们都有 $Y$ 。

那么 unsupervised learning 我们可以简单理解为去 learn $P_{X} (x)$ 吗？不一定。

首先，density estimation $P_{X} (x)$ 的确算是 unsupervised learning 的范畴，但其实还有很多的 unsupervised learning 是 $P_{K | X} (k | x)$ 的形式，比如：

clustering 可以看做是 learn $P_{C | X} (c | x)$
PCA 可以看做是 learn $P_{X^{'} | X} (x^{'} | x)$
embedding 可以看做是 learn $P_{E | X} (e | x)$

回到 Frequentist vs Bayesian 的讨论。那我们其实也可以让 $D = X$ 或者 $D = K | X$ (虽然一开始 $K$ 未知)，这么一来也可以有 Frequentist unsupervised learning 和 Bayesian unsupervised learning

5. Frequentist Discriminative Example: Linear RegressionPermalink

我们之前在 Terminology Recap: Sampling / Sample / Sample Space / Experiment / Statistical Model / Statistic / Estimator / Empirical Distribution / Likelihood / Estimation and Machine Learning 有说：

MLE 等价于 minimizing KL divergence $D_{K L} ({\hat{P}}_{data} ‖ P_{model})$
MLE 等价于 minimizing cross-entropy $H ({\hat{P}}_{data}, P_{model})$
- When $P_{model}$ is Gaussian，等价于 minimizing $MSE$
- 亦即 $MSE$ is the cross-entropy between the empirical distribution and a Gaussian model.

linear regression 并没有说需要 assumption on Gaussian distributions，但是你会注意到我们 linear regression 一般是 minimizing $MSE$ ，联系我们之前说到的 “ $MSE$ is the cross-entropy between the empirical distribution and a Gaussian model”，那么 linear regression 中到底哪里出现了 Gaussian model 呢？

We can imagine that with an infinitely large training set, we might see several training examples with the same input value $x$ but different values of $y$ . 那我们假设单个 input $x_{1}$ 对应的所有 output 构成一个 sample $Y_{1}$ ，你的 training data $y_{1}$ 只是 $Y_{1}$ 的一个 value；我们假设 $Y_{1} | x_{1} \sim N (?, ?)$ (这是我们在 $x_{1}$ 处的 $P_{model}$ )

根据 linear regression 的 assumption， $Y_{1} = w x_{1} + ϵ$ ，然后 $ϵ$ 是 Bayes error，所以假设有 $ϵ \sim N (0, σ^{2})$ ，所以有：

\begin{aligned} E [Y_{1}] & = E [w x_{1}] + E [ϵ] = w x_{1} + 0 = w x_{1} \\ Var (Y_{1}) & = Var (w x_{1}) + Var (ϵ) = 0 + σ^{2} = σ^{2} \end{aligned}

进而我们可以得出 $Y_{1} | x_{1} \sim N (w x_{1}, σ^{2})$ ，这就是我们在 $x_{1}$ 处的 $P_{model}$ ；然后所有 $x_{1}$ 对应的 training data 构成 ${\hat{P}}_{data}$ ，但在这里我们不需要知道 empirical distribution 具体长啥样，我们只关心 cross-entropy：

你也可以把所有的 $Y_{i}$ 集合起来，这时 $P_{model}$ 是一个多元的 Gaussian： $Y | X \sim N (w X, σ^{2})$

\begin{aligned} w_{ML} & = \underset{w}{\arg max} \prod_{i = 1}^{m} P_{Y_{i} | x_{i}} (y_{i}) \\ = \underset{w}{\arg max} \sum_{i = 1}^{m} \ln P_{Y_{i} | x_{i}} (y_{i}) \\ = \underset{w}{\arg max} \sum_{i = 1}^{m} \ln \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(y_{i} - w x_{i})^{2}}{2 σ^{2}}} \\ = \underset{w}{\arg max} - \frac{m}{2} \ln 2 π - \frac{m}{2} \ln σ^{2} - \frac{\sum_{i = 1}^{m} (y_{i} - w x_{i})^{2}}{2 σ^{2}} \\ = \underset{w}{\arg min} \sum_{i = 1}^{m} (y_{i} - w x_{i})^{2} \\ = \underset{w}{\arg min} MSE (Y | X) \end{aligned}

参考资料 Maximum Likelihood Estimation For Regression
这个例子是否能说明：只要出现了 minimizing $MSE$ 的算法都能找到对应的 Gaussian model 的解释？

最后说一下 prediction：

对一个新来的 test data point $x_{m + 1}$ ，prediction 很简单： ${\hat{y}}_{m + 1} = w_{ML} x_{m + 1}$ 。但这个式子是怎么得来的呢？
因为我们的 assumption 是 $Y_{m + 1} = w x_{m + 1} + ϵ$ ，所以我觉得应该把 prediction 理解成 ${\hat{y}}_{m + 1} = E [Y_{m + 1}] = w_{ML} x_{m + 1}$

6. Bayesian Generative Example: Naive Bayes ClassifierPermalink

参考：

简单说就是 $D = Y, X$ ，然后对任意一个 data point $(y^{(i)}, x^{(i)})$ ，把 $x^{(i)}$ 部分看成是一个 feature vector： $x^{(i)} = {x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{n}^{(i)}}$ ；然后 assume feature 之间是 independent 的，于是有：

\begin{aligned} P (Y | X) & \propto P (X | Y) P (Y) \\ \propto (\prod_{i = 1}^{n} P (x_{i} | Y)) P (Y) \end{aligned}

这个时候你把 $Y$ 当做 $θ$ ， $P (θ)$ 可以用 $Y$ 的 empirical distribution 代替，于是有：

\begin{aligned} P (θ | X) & \propto (\prod_{i = 1}^{n} P (x_{i} | θ)) P (θ) \\ \propto \sum_{i = 1}^{n} \log P (x_{i} | θ) + \log P (θ) \end{aligned}

接着用 MAP 就可以了。

注意 Michael Collins: The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm 中经过一番变形后使用了 MLE，我觉得没有必要这么绕

至于 $P (x_{i} | θ)$ 的 assumption 和处理，请看 Yao’s Blog: Naive Bayes classifier

X Facebook LinkedIn Bluesky

Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning / Linear Regression / Naive Bayes Classifier

1. Generative vs DiscriminativePermalink

2. Frequentist vs BayesianPermalink

3. Generative vs Discriminative, Frequentist vs BayesianPermalink

4. Unsupervised vs SupervisedPermalink

5. Frequentist Discriminative Example: Linear RegressionPermalink

6. Bayesian Generative Example: Naive Bayes ClassifierPermalink

分享

留下评论

猜您还喜欢

LL(0) vs. LL(1) Grammars: From Single-String to Flexible Repetition

Lark’s implementation of computing FIRST and FOLLOW sets

LL(1) Parsing

Top-Down Parsers: Recursive Descent, Predictive, and More