Bayesian Interpretation for Ridge Regression and the Lasso + Exercise 7

October 2, 2014 11 分钟阅读

总结自 Bayesian Interpretation for Ridge Regression and the Lasso, Section 6.2.2 The Lasso, An Introduction to Statistical Learning.

水略深所以单独开一篇。

1. 啥是 BayesPermalink

1.1. DictionaryPermalink

Bayes: [ˈbeɪz]
a priori: [ˌɑpriˈɔri], from Latin a priori (“former”), literally “from the former”.
- (logic) Based on hypothesis rather than experiment
  - 翻译成 “先验的” 应该是指 before experiment 的意思
  - A priori knowledge or justification is independent of experience (for example “All bachelors are unmarried”). 有一种 “显而易见，无需证明” 的感觉。
- Presumed without analysis
  - One assumes, a priori, that a parent would be better at dealing with problems, 想当然地
a posteriori: [ˌɑpɒsteriˈɔ:rɪ] or [ˌeɪpɒsteriˈɔ:raɪ], from Latin a posteriori (“latter”), literally “from the latter”.
- Relating to or derived by reasoning from observed facts; Empirical
  - A posteriori knowledge or justification is dependent on experience or empirical evidence (for example “Some bachelors I have met are very happy”).
  - What Locke calls “knowledge” they have called “a priori knowledge”; what he calls “opinion” or “belief” they have called “a posteriori” or “empirical knowledge”.
prior: [ˈpraɪə(r)]
posterior: [pɒˈstɪəriə(r)]

1.2. 绝好的科普文Permalink

这篇数学之美番外篇：平凡而又神奇的贝叶斯方法写得不能更好。在此摘录个贝叶斯方法的历史：

所谓的贝叶斯方法源于他生前为解决一个 “逆概率” 问题写的一篇文章，而这篇文章是在他死后才由他的一位朋友 Richard Price 发表出来的。在贝叶斯写这篇文章之前，人们已经能够计算 “正向概率”，如 “假设袋子里面有N个白球，M个黑球，你伸手进去摸一把，摸出黑球的概率是多大”。而一个自然而然的问题是反过来：”如果我们事先并不知道袋子里面黑白球的比例，而是闭着眼睛摸出一个（或好几个）球，观察这些取出来的球的颜色之后，那么我们可以就此对袋子里面的黑白球的比例作出什么样的推测”。这个问题，就是所谓的逆概率问题。

1.3 Bayes’ theorem 的变形Permalink

不禁想吐槽一下，在 Conditional Probability 里你就没有发现能把 $P (B | A) = \frac{P (B \cap A)}{P (A)}$ 用到贝叶斯公式里么……

我们只变化贝叶斯公式的分母部分：

\begin{aligned} P (B | A) & = \frac{P (A | B) P (B)}{P (A | B) P (B) + P (A | B^{c}) P (B^{c})} \\ = \frac{P (A | B) P (B)}{P (A \cap B) + P (A \cap B^{c})} \\ (1.1) & = \frac{P (A | B) P (B)}{P (A)} \end{aligned}

还是以摸球为例：

$B$ : 袋子里黑白球的比例是 blah blah blah
$A$ : 在不知道袋子里面黑白球比例的情况下，摸了 xxx 个球，yyy 个白的，zzz 个黑的

再根据这篇 Understand Bayes Theorem (prior/likelihood/posterior/evidence)，有：

$p (B | A)$ is posterior (probablity) distribution
- the probablity of $B$ posterior to (after) the observation of $A$
- 注意我们这里不说 $p (B | A)$ 是 posterior probablity。因为严格说来 $p (B | A)$ 是一个分布律，是一个概率函数，从定义上说是一个分布，而不是一个具体的概率值。当然你理解成一个概率值也无可厚非。prior 同。
$p (A | B)$ is likelihood
- reversely, when $B$ happened, how likely will $A$ happen?
- 从 1.4 来看，似乎不能直接叫 likelihood，待调查
$p (B)$ is prior (probablity) distribution
- prior to (before) any observation, what is the chance of $B$ ?
$p (A)$ is the probablity of evidence
- $A$ 是已经发生的，是事实，是我们推测 $B$ 的 evidence

如果忽略掉 evidence 的话（它是个常数），我们可以得到：

\begin{matrix} (1.2) & p o s t e r i o r \propto p r i o r \times l i k e l i h o o d \end{matrix}

$\propto$ 读作 is proportional to 或 varies as。

$y \propto x$ simply means that $y = k x$ for some constant $k$ . (符号解释摘自 List of mathematical symbols)

1.4 在 regression 中的应用Permalink

根据这篇 Likelihood Function Confusions 讲义，一种常见的形式如：

$Y$ : the observed data
$θ$ : the parameters
$P (Y | θ)$ : the joint distribution of the sample, which is proportional to the likelihood function
$P (θ)$ : the prior distribution of the parameters

2. Bayesian Interpretation for Ridge Regression and the LassoPermalink

书上的写法有一点奇怪。按照 Bayesian Interpretations of Regularization 这篇讲义的说法：

$p (Y | X, β)$ is the joint distribution over outputs $Y$ given inputs $X$ and the parameters $β$ .
The likelihood of any fixed parameter vector $β$ is $L (β | X) = p (Y | X, β)$

剩下的部分直接看 P226 好了。Bayesian Interpretations of Regularization 这篇讲义上有些推导过程，很有帮助。

3. Exercise 7Permalink

We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.

(a) QuestionPermalink

Suppose that $y_{i} = β_{0} + \sum_{j = 1}^{p} x_{i j} β_{j} + ϵ_{i}$ where $ϵ_{1}, \dots, ϵ_{n}$ are independent and identically distributed from a $N (0, σ^{2})$ distribution. Write out the likelihood for the data.

(a) AnswerPermalink

The likelihood for the data is:

\begin{aligned} L (θ | β) & = p (β | θ) \\ = p (β_{1} | θ) \times \dots \times p (β_{n} | θ) \\ = \prod_{i = 1}^{n} p (β_{i} | θ) \\ = \prod_{i = 1}^{n} \frac{1}{σ \sqrt{2 π}} \exp (- \frac{{| Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j}) |}^{2}}{2 σ^{2}}) \\ = {(\frac{1}{σ \sqrt{2 π}})}^{n} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {| Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j}) |}^{2}) \end{aligned}

(b) QuestionPermalink

Assume the following prior for $β$ : $β_{1}, \dots, β_{p}$ are independent and identically distributed according to a double-exponential distribution with mean 0 and common scale parameter $b$ : i.e. $p (β) = \frac{1}{2 b} e x p (- \frac{| β |}{b})$ . Write out the posterior for $β$ in this setting.

(b) AnswerPermalink

The posterior with double exponential (Laplace Distribution) with mean 0 and common scale parameter $b$ , i.e. $p (β) = \frac{1}{2 b} \exp (- | β | / b)$ is:

f (β | X, Y) \propto f (Y | X, β) p (β | X) = f (Y | X, β) p (β)

Substituting our values from (a) and our density function gives us:

\begin{aligned} f (Y | X, β) p (β) \\ = {(\frac{1}{σ \sqrt{2 π}})}^{n} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {| Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j}) |}^{2}) (\frac{1}{2 b} \exp (- \frac{| β |}{b})) \\ = {(\frac{1}{σ \sqrt{2 π}})}^{n} (\frac{1}{2 b}) \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {| Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j}) |}^{2} - \frac{| β |}{b}) \end{aligned}

(c) QuestionPermalink

Argue that the lasso estimate is the mode for $β$ under this posterior distribution.

(c) AnswerPermalink

Showing that the Lasso estimate for $β$ is the mode under this posterior distribution is the same thing as showing that the most likely value for $β$ is given by the lasso solution with a certain $λ$ .

We can do this by taking our likelihood and posterior and showing that it can be reduced to the canonical Lasso Equation 6.7 from the book.

Let’s start by simplifying it by taking the logarithm of both sides:

\begin{aligned} \log (f (Y | X, β) p (β)) \\ = \log [{(\frac{1}{σ \sqrt{2 π}})}^{n} (\frac{1}{2 b}) \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {| Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j}) |}^{2} - \frac{| β |}{b})] \\ = \log [{(\frac{1}{σ \sqrt{2 π}})}^{n} (\frac{1}{2 b})] - (\frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {| Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j}) |}^{2} + \frac{| β |}{b}) \end{aligned}

We want to maximize the posterior, this means:

\begin{aligned} \arg max_{β} f (β | X, Y) \\ = \arg max_{β} \log [{(\frac{1}{σ \sqrt{2 π}})}^{n} (\frac{1}{2 b})] - (\frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {| Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j}) |}^{2} + \frac{| β |}{b}) \end{aligned}

Since we are taking the difference of two values, the maximum of this value is the equivalent to taking the difference of the second value in terms of $β$ . This results in:

\begin{aligned} = \arg min_{β} \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + \frac{| β |}{b} \\ = \arg min_{β} \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + \frac{1}{b} \sum_{j = 1}^{p} | β_{j} | \\ = \arg min_{β} \frac{1}{2 σ^{2}} (\sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + \frac{2 σ^{2}}{b} \sum_{j = 1}^{p} | β_{j} |) \end{aligned}

By letting $λ = \frac{2 σ^{2}}{b}$ , we can see that we end up with:

\begin{aligned} = \arg min_{β} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + λ \sum_{j = 1}^{p} | β_{j} | \\ = \arg min_{β} RSS + λ \sum_{j = 1}^{p} | β_{j} | \end{aligned}

which we know is the Lasso from Equation 6.7 in the book. Thus we know that when the posterior comes from a Laplace distribution with mean zero and common scale parameter $b$ , the mode for $β$ is given by the Lasso solution when $λ = \frac{2 σ^{2}}{b}$ .

(d) QuestionPermalink

Now assume the following prior for $β$ : $β_{1}, \dots, β_{p}$ are independent and identically distributed according to a normal distribution with mean zero and variance $c$ . Write out the posterior for $β$ in this setting.

(d) AnswerPermalink

The posterior distributed according to Normal distribution with mean 0 and variance $c$ is:

\begin{array}{r} f (β | X, Y) \propto f (Y | X, β) p (β | X) = f (Y | X, β) p (β) \end{array}

Our probability distribution function then becomes:

\begin{aligned} p (β) & = \prod_{i = 1}^{p} p (β_{i}) \\ = \prod_{i = 1}^{p} \frac{1}{\sqrt{2 c π}} \exp (- \frac{β_{i}^{2}}{2 c}) \\ = {(\frac{1}{\sqrt{2 c π}})}^{p} \exp (- \frac{1}{2 c} \sum_{i = 1}^{p} β_{i}^{2}) \end{aligned}

Substituting our values from (a) and our density function gives us:

\begin{aligned} f (Y | X, β) p (β) \\ = {(\frac{1}{σ \sqrt{2 π}})}^{n} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2}) {(\frac{1}{\sqrt{2 c π}})}^{p} \exp (- \frac{1}{2 c} \sum_{i = 1}^{p} β_{i}^{2}) \\ = {(\frac{1}{σ \sqrt{2 π}})}^{n} {(\frac{1}{\sqrt{2 c π}})}^{p} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} - \frac{1}{2 c} \sum_{i = 1}^{p} β_{i}^{2}) \end{aligned}

(e) QuestionPermalink

Argue that the ridge regression estimate is both the mode and the mean for $β$ under this posterior distribution.

(e) AnswerPermalink

Like from part (c), showing that the Ridge Regression estimate for $β$ is the mode and mean under this posterior distribution is the same thing as showing that the most likely value for $β$ is given by the lasso solution with a certain $λ$ .

We can do this by taking our likelihood and posterior and showing that it can be reduced to the canonical Ridge Regression Equation 6.5 from the book.

Once again, we can take the logarithm of both sides to simplify it:

\begin{aligned} \log (f (Y | X, β) p (β)) \\ = {(\frac{1}{σ \sqrt{2 π}})}^{n} {(\frac{1}{\sqrt{2 c π}})}^{p} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} - \frac{1}{2 c} \sum_{i = 1}^{p} β_{i}^{2}) \\ = \log [{(\frac{1}{σ \sqrt{2 π}})}^{n} {(\frac{1}{\sqrt{2 c π}})}^{p}] - (\frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + \frac{1}{2 c} \sum_{i = 1}^{p} β_{i}^{2}) \end{aligned}

We want to maximize the posterior, this means:

\begin{aligned} \arg max_{β} f (β | X, Y) \\ = \arg max_{β} \log [{(\frac{1}{σ \sqrt{2 π}})}^{n} {(\frac{1}{\sqrt{2 c π}})}^{p}] - (\frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + \frac{1}{2 c} \sum_{i = 1}^{p} β_{i}^{2}) \end{aligned}

Since we are taking the difference of two values, the maximum of this value is the equivalent to taking the difference of the second value in terms of $β$ . This results in:

\begin{aligned} = \arg min_{β} (\frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + \frac{1}{2 c} \sum_{i = 1}^{p} β_{i}^{2}) \\ = \arg min_{β} (\frac{1}{2 σ^{2}}) (\sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + \frac{σ^{2}}{c} \sum_{i = 1}^{p} β_{i}^{2}) \end{aligned}

By letting $λ = \frac{σ^{2}}{c}$ , we end up with:

\begin{aligned} = \arg min_{β} (\frac{1}{2 σ^{2}}) (\sum_{i = 1}^{n} {[Y_{i} - (β_{0} + \sum_{j = 1}^{p} β_{j} X_{i j})]}^{2} + λ \sum_{i = 1}^{p} β_{i}^{2}) \\ = \arg min_{β} RSS + λ \sum_{i = 1}^{p} β_{i}^{2} \end{aligned}

which we know is the Ridge Regression from Equation 6.5 in the book. Thus we know that when the posterior comes from a normal distribution with mean zero and variance $c$ , the mode for $β$ is given by the Ridge Regression solution when $λ = \frac{σ^{2}}{c}$ . Since the posterior is Gaussian, we also know that it is the posterior mean.

X Facebook LinkedIn Bluesky