ISL: Linear Model Selection and Regularization - Part 1
总结自 Chapter 6, An Introduction to Statistical Learning.
0. OverviewPermalink
In the regression setting, the standard linear model
is commonly used to describe the relationship.
In the chapters that follow, we consider some approaches for extending the linear model framework.
- In Chapter 7 we generalize
in order to accommodate non-linear, but still additive, relationships. - In Chapter 8 we consider even more general non-linear models.
However, the linear model has distinct advantages in terms of inference and, on real-world problems, is often surprisingly competitive in relation to non-linear methods. Hence, before moving to the non-linear world, we discuss in this chapter some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.
As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.
- Prediction Accuracy:
- If
, the least squares estimates tend to also have low variance, and hence will perform well on test observations. - However, if
is not much larger than , then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions. - Even worse, when
, least squares cannot be used. - By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias. This can lead to substantial improvements in the prediction accuracy.
- If
- Model Interpretability:
- It is often the case that some of the variables used in a multiple regression model are in fact not associated with the response. These irrelevant variables leads to unnecessary complexity in the resulting model.
- Now least squares is extremely unlikely to yield any coefficient estimates that are exactly zero for the potential irrelevant variables.
- In this chapter, we see some approaches for automatically performing feature selection or variable selection to exclude irrelevant variables from a multiple regression model.
In this chapter, we discuss three important alternatives.
- Subset Selection: identifying a subset of the
predictors that we believe to be related to the response. - Shrinkage:
- fitting a model involving all
predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. - Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection.
- fitting a model involving all
- Dimension Reduction:
- projecting the
predictors into a -dimensional subspace, where - by computing
different linear combinations, or projections, of the variables - 说得这么玄乎其实就是 PCA 那些
- projecting the
1. Subset SelectionPermalink
这里说的其实就是 Linear Regression - Part 1 那篇的 2.4 Question 2: How to Decide on Important Variables? Or, do all the predictors help to explain
1.1 Best Subset SelectionPermalink
简单说就是试遍
具体的 algorithm 是:
- Let
denote the null model, which contains no predictors.- This model simply predicts the sample mean for each observation.
- For
:- Fit all
models that contain exactly predictors. - Pick the best among these
models, and call it .- Here “best” is defined as having the smallest RSS, or equivalently largest
.
- Here “best” is defined as having the smallest RSS, or equivalently largest
- Fit all
- Select the best model from
, using cross-validated prediction error, , AIC, BIC, or adjusted .
一共会 try
感觉是不是有点像多路排序算法?而且我还看到了使用 MapReduce 的可能……
为什么 Step 3 不继续用 RSS 或者
- Because the RSS of these
models decreases monotonically, and the increases monotonically, as the number of features included in the models increases. Therefore, if we use these statistics to select the best model, then we will always end up with a model involving all of the variables. - More generally, a low RSS or a high
indicates a model with a low training error, whereas we wish to choose a model that has a low test error.
Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression. In the case of logistic regression, instead of ordering models by RSS in Step 2, we instead use the deviance, a measure that plays the role of RSS for a broader class of models. The deviance is -2 times the maximized log-likelihood; the smaller the deviance, the better the fit.
Best subset selection becomes computationally infeasible for values of p greater than around 40.
* There are computational shortcuts — so called branch-and-bound techniques — for eliminating some choices, but they have their limitations as
1.2 Stepwise SelectionPermalink
我们称这个搜索空间是 model space。相比 Best Subset Selection 而言,Stepwise Selection performs a guided search over model space, and so the effective model space will be greatly smaller when
1.2.1 Forward Stepwise SelectionPermalink
The Forward Stepwise Selection algorithm goes here:
- Let
denote the null model, which contains no predictors. - For
:- Fit all
models that that augment ([ɔ:gˈment], =increase) the predictors in with one additional predictor. - Pick the best among these
models, and call it .- Here “best” is defined as having the smallest RSS, or equivalently largest
.
- Here “best” is defined as having the smallest RSS, or equivalently largest
- Fit all
- Select the best model from
, using cross-validated prediction error, , AIC, BIC, or adjusted .
一共会 try
相比 Best Subset Selection 而言,Forward Stepwise Selection 最大的优势是计算量小了很多。缺点是可能最后的 model 不太准,尤其是 “最优的 model 实际并不包含
Forward stepwise selection can be applied even in the high-dimensional setting where
1.2.2 Backward Stepwise SelectionPermalink
The Backward Stepwise Selection algorithm goes here:
- Let
denote the full model, which contains all predictors. - For
:- Fit all
models that contain all but one of the predictors in , for a total of predictors. - Pick the best among these
models, and call it .- Here “best” is defined as having the smallest RSS, or equivalently largest
.
- Here “best” is defined as having the smallest RSS, or equivalently largest
- Fit all
- Select the best model from
, using cross-validated prediction error, , AIC, BIC, or adjusted .
一共会 try
Also like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model.
Differently, backward stepwise selection requires
1.2.3 Hybrid ApproachesPermalink
- Variables are added to the model sequentially, in analogy to forward selection.
- However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit.
- Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages.
1.3 Choosing the Optimal ModelPermalink
前面已经说过:
…… a low RSS or a high
indicates a model with a low training error, whereas we wish to choose a model that has a low test error.
所以我们在 Subset Selection 的 Step 3 没有用 RSS 和
这其实反映了另外一个问题:我们要用 test error 做标准来选 model,那么该如何 estimate test error 呢?(已知直接估计 training error = test error 是明显不妥的)There are two common approaches:
- Indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting
- 也就是下面要说的、Subset Selection 的 Step 3 里用到的
, AIC, BIC, or adjusted
- 也就是下面要说的、Subset Selection 的 Step 3 里用到的
- Directly estimate the test error using either a validation set approach or a CV approach.
1.3.1 , AIC, BIC, and Adjusted Permalink
We show in Chapter 2 that the training set MSE is generally an underestimate of the test MSE. (Recall that
However, a number of techniques for adjusting the training error are available, which can be used to select among a set of models with different numbers of variables. We introduce 4 such approaches here:
-
- Akaike information criterion (AIC)
- criterion, [kraɪˈtɪəriən], a standard on which a judgment or decision may be based
- Bayesian information criterion (BIC)
- adjusted
For a fitted least squares model containing
where
Essentially, the
Beyond the scope of this book, one can show that if
The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the model
where, for simplicity, we have omitted an additive constant. Hence for least squares models,
BIC is derived from a Bayesian point of view. For the least squares model with d predictors, the BIC is, up to irrelevant constants, given by
Notice that BIC replaces the
Recall from Chapter 3 that
Since RSS always decreases as more variables are added to the model, the
Unlike
1.3.2 Validation and Cross-ValidationPermalink
This procedure has an advantage relative to
不管是用 AIC、BIC 这些指标还是用 CV 来测,都有可能遇到多个模型 more or less equally good 的情况。此时可以用 one-standard-error rule:
- We first calculate the
of the estimated test MSE for each model size. - And then select the smallest model for which the estimated test error is within one
of the lowest point on the curve. - The rationale here is that if a set of models appear equally good, then we might as well choose the simplest model — that is, the model with the smallest number of predictors.
2. Shrinkage MethodsPermalink
Shrinkage is a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero which can significantly reduce model variance.
The two best-known shrinkage methods are:
- ridge regression
- the lasso
2.1 Ridge RegressionPermalink
其实就是 Ng 课上的 Regression with Regularization,只不过 Ng 没有取这么多名字。
没有 regularization 时,
Note:
is called a shrinkage penalty, which becomes small when are close to 0, and so it has the effect of shrinking the estimates of s towards 0.- shrinkage penalty is not applied to the intercept
, because we do not want to — it is simply a measure of the mean value of the response when
- shrinkage penalty is not applied to the intercept
is a tuning parameter. As , the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach 0 more closely.- When
is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the null model that contains no predictors. - Selecting a good value for
is critical. We can use CV to do this.
- When
P216 做了个 application,然后做了个测量。注意测量时用了
P217 讨论了
P218 讨论了 Why Does Ridge Regression Improve Over Least Squares? 其实很简单:
- Ridge Regression shrinks the estimates of
s towards 0- => simpler, less flexible model
- => less overfitting; more underfitting
- => lower variance; higher bias
- => less overfitting; more underfitting
- => simpler, less flexible model
Hence, ridge regression works best in situations where the least squares estimates have high variance.
2.2 The LassoPermalink
Ridge regression does have one obvious disadvantage that, unlike subset selection, ridge regression will include all
The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage, which has
- The ridge penalty is an
penalty. - The lasso penalty is an
penalty.
The
As a result, models generated from the lasso are generally much easier to interpret than those produced by ridge regression. We say that the lasso yields sparse models — that is, models that involve only a subset of the variables.
P220 介绍了 Another Formulation for Ridge Regression and the Lasso,其实就是引入了一个 budget
后面那个 subject to 的条件和 budget 可以自由变化,表示各种 regression 方法。
P222 的 FIGURE 6.7. 从几何学的角度解释了为什么 lasso 会驱使某些 coefficient 为 0,值得一看。
P223 起是 Comparing the Lasso and Ridge Regression,提到说:The lasso implicitly assumes that a number of the coefficients truly equal zero. 所以 performance 的好坏还是得看 true relationship 具体是什么情况。但是 true relationship 是不可知的,所以我们实际的做法还是用 CV 来测试。粗略地来说,lasso 也是降 variance 升 bias,然后 interpretability 肯定是要好过 ridge。
P224-225 写的是 A Simple Special Case for Ridge Regression and the Lasso。用了一个理想化的例子,不得不说设计的非常精彩。最终得到的一个感性的认识是:Ridge regression more or less shrinks every dimension of the data by the same proportion, whereas the lasso more or less shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero.
P226 是 Bayesian Interpretation for Ridge Regression and the Lasso,我会单独开一篇来科普一下。
2.3 Selecting the Tuning ParameterPermalink
P227。主要说的是用 CV 来选
Comments