Bayesian Interpretation for Ridge Regression and the Lasso + Exercise 7
总结自 Bayesian Interpretation for Ridge Regression and the Lasso, Section 6.2.2 The Lasso, An Introduction to Statistical Learning.
水略深所以单独开一篇。
1. 啥是 BayesPermalink
1.1. DictionaryPermalink
- Bayes: [ˈbeɪz]
- a priori: [ˌɑpriˈɔri], from Latin a priori (“former”), literally “from the former”.
- (logic) Based on hypothesis rather than experiment
- 翻译成 “先验的” 应该是指 before experiment 的意思
- A priori knowledge or justification is independent of experience (for example “All bachelors are unmarried”). 有一种 “显而易见,无需证明” 的感觉。
- Presumed without analysis
- One assumes, a priori, that a parent would be better at dealing with problems, 想当然地
- (logic) Based on hypothesis rather than experiment
- a posteriori: [ˌɑpɒsteriˈɔ:rɪ] or [ˌeɪpɒsteriˈɔ:raɪ], from Latin a posteriori (“latter”), literally “from the latter”.
- Relating to or derived by reasoning from observed facts; Empirical
- A posteriori knowledge or justification is dependent on experience or empirical evidence (for example “Some bachelors I have met are very happy”).
- What Locke calls “knowledge” they have called “a priori knowledge”; what he calls “opinion” or “belief” they have called “a posteriori” or “empirical knowledge”.
- Relating to or derived by reasoning from observed facts; Empirical
- prior: [ˈpraɪə(r)]
- posterior: [pɒˈstɪəriə(r)]
1.2. 绝好的科普文Permalink
这篇 数学之美番外篇:平凡而又神奇的贝叶斯方法 写得不能更好。在此摘录个贝叶斯方法的历史:
所谓的贝叶斯方法源于他生前为解决一个 “逆概率” 问题写的一篇文章,而这篇文章是在他死后才由他的一位朋友 Richard Price 发表出来的。在贝叶斯写这篇文章之前,人们已经能够计算 “正向概率”,如 “假设袋子里面有N个白球,M个黑球,你伸手进去摸一把,摸出黑球的概率是多大”。而一个自然而然的问题是反过来:”如果我们事先并不知道袋子里面黑白球的比例,而是闭着眼睛摸出一个(或好几个)球,观察这些取出来的球的颜色之后,那么我们可以就此对袋子里面的黑白球的比例作出什么样的推测”。这个问题,就是所谓的逆概率问题。
1.3 Bayes’ theorem 的变形Permalink
不禁想吐槽一下,在 Conditional Probability 里你就没有发现能把
我们只变化贝叶斯公式的分母部分:
还是以摸球为例:
: 袋子里黑白球的比例是 blah blah blah : 在不知道袋子里面黑白球比例的情况下,摸了 xxx 个球,yyy 个白的,zzz 个黑的
再根据这篇 Understand Bayes Theorem (prior/likelihood/posterior/evidence),有:
is posterior (probablity) distribution- the probablity of
posterior to (after) the observation of - 注意我们这里不说
是 posterior probablity。因为严格说来 是一个分布律,是一个概率函数,从定义上说是一个分布,而不是一个具体的概率值。当然你理解成一个概率值也无可厚非。prior 同。
- the probablity of
is likelihood- reversely, when
happened, how likely will happen? - 从 1.4 来看,似乎不能直接叫 likelihood,待调查
- reversely, when
is prior (probablity) distribution- prior to (before) any observation, what is the chance of
?
- prior to (before) any observation, what is the chance of
is the probablity of evidence 是已经发生的,是事实,是我们推测 的 evidence
如果忽略掉 evidence 的话(它是个常数),我们可以得到:
1.4 在 regression 中的应用Permalink
根据这篇 Likelihood Function Confusions 讲义,一种常见的形式如:
: the observed data : the parameters : the joint distribution of the sample, which is proportional to the likelihood function : the prior distribution of the parameters
2. Bayesian Interpretation for Ridge Regression and the LassoPermalink
书上的写法有一点奇怪。按照 Bayesian Interpretations of Regularization 这篇讲义的说法:
is the joint distribution over outputs given inputs and the parameters .- The likelihood of any fixed parameter vector
is
剩下的部分直接看 P226 好了。Bayesian Interpretations of Regularization 这篇讲义上有些推导过程,很有帮助。
3. Exercise 7Permalink
We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.
(a) QuestionPermalink
Suppose that
(a) AnswerPermalink
The likelihood for the data is:
(b) QuestionPermalink
Assume the following prior for
(b) AnswerPermalink
The posterior with double exponential (Laplace Distribution) with mean 0 and common scale parameter
Substituting our values from (a) and our density function gives us:
(c) QuestionPermalink
Argue that the lasso estimate is the mode for
(c) AnswerPermalink
Showing that the Lasso estimate for
We can do this by taking our likelihood and posterior and showing that it can be reduced to the canonical Lasso Equation 6.7 from the book.
Let’s start by simplifying it by taking the logarithm of both sides:
We want to maximize the posterior, this means:
Since we are taking the difference of two values, the maximum of this value is the equivalent to taking the difference of the second value in terms of
By letting
which we know is the Lasso from Equation 6.7 in the book. Thus we know that when the posterior comes from a Laplace distribution with mean zero and common scale parameter
(d) QuestionPermalink
Now assume the following prior for
(d) AnswerPermalink
The posterior distributed according to Normal distribution with mean 0 and variance
Our probability distribution function then becomes:
Substituting our values from (a) and our density function gives us:
(e) QuestionPermalink
Argue that the ridge regression estimate is both the mode and the mean for
(e) AnswerPermalink
Like from part (c), showing that the Ridge Regression estimate for
We can do this by taking our likelihood and posterior and showing that it can be reduced to the canonical Ridge Regression Equation 6.5 from the book.
Once again, we can take the logarithm of both sides to simplify it:
We want to maximize the posterior, this means:
Since we are taking the difference of two values, the maximum of this value is the equivalent to taking the difference of the second value in terms of
By letting
which we know is the Ridge Regression from Equation 6.5 in the book. Thus we know that when the posterior comes from a normal distribution with mean zero and variance
Comments