ISL: Resampling Methods
总结自 Chapter 5, An Introduction to Statistical Learning.
0. OverviewPermalink
Resampling methods involve:
- repeatedly drawing samples from a training set
- refitting a model of interest on each sample in order to obtain additional information about the fitted model
In this chapter, we discuss two of the most commonly used resampling methods:
- cross-validation, which can be used for
- model assessment, the process of evaluating a model’s performance
- model selection, the process of selecting the proper level of flexibility for a model
- bootstrap, which is used most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.
1. Cross-ValidationPermalink
注意我们的目的是 estimate the test error rate,因为真实的 test error rate 是不可测量的。
1.1 The Validation Set ApproachPermalink
The validation set approach involves randomly dividing the available set of observations into two parts, a training set and a validation set (a.k.a. hold-out set).
- The model is fit on the training set.
- The fitted model is used to predict the responses for the observations in the validation set.
- The resulting validation set error rate — typically assessed using MSE in the case of a quantitative response — provides an estimate of the test error rate.
The validation set approach is conceptually simple and is easy to implement. But it has two potential drawbacks:
- The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
- Only a subset of the observations — those that are included in the training set — are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
Cross-validation is a refinement of the validation set approach that addresses these two issues.
简单来说:
- Validation Set Approach: 只有一次分割 Training / Validation 的操作
- 所以也只有一个 estimate on validation set
- Cross Validation: 有多次分割 Training / Validation 的操作
- 所以可以在多个 validation set 上 estimate 再求平均
- validation set == hold-out set
1.2 Leave-One-Out Cross-ValidationPermalink
P178
简单说就是每次只留一个
The LOOCV estimate for the test MSE is the average of these
LOOCV’s advantages over the validation set approach:
- far less bias because of larger size of training set
- Consequently, the LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does
- when applied repeatedly
- due to randomness in the training/validation set splits, validation set approach may yields quite different results.
- LOOCV’s always yields the same results.
LOOCV can be very time consuming if
where
1.3 k-Fold Cross-ValidationPermalink
This approach involves randomly dividing the set of observations into
The
It is not hard to see that LOOCV is a special case of
The most obvious advantage over LOOCV is computational.
1.4 Bias-Variance Trade-Off for k-Fold Cross-ValidationPermalink
On bias:
- LOOCV 的 training set 的比例最大,所以 LOOCV will give approximately unbiased estimates of the test error
- k-fold (
)的 bias 略高 - Validation Set Approach 自然是 bias 最高
On variance: It turns out that LOOCV has higher variance than does k-fold CV with
- When we perform LOOCV, we are in effect averaging the outputs of
fitted models, each of which is trained on an almost identical set of observations; therefore, these outputs are highly (positively) correlated with each other. - In contrast, when we perform k-fold CV with
, we are averaging the outputs of fitted models that are somewhat less correlated with each other, since the overlap between the training sets in each model is smaller. - Since the mean of many highly correlated quantities has higher variance, the test error estimate resulting from LOOCV tends to have higher variance than the one from
-fold CV.
1.5 Cross-Validation on Classification ProblemsPermalink
Cross-validation works the same way on classification problems, except that rather than using MSE, we instead use the number of misclassified observations, i.e. error rate to quantify test error.
P185-186 是一个借助 CV 来选择 Order of Polynomials 的例子,其实挺简单的,用
2. The BootstrapPermalink
先解释下 bootstrap 这个词是啥意思吧,用了这么久的 jekyll bootstrap 也没认真查过……
首先 strap 是 a narrow and usually flat piece of a material that is used for fastening, holding together, or wrapping something。
然后 bootstrap 的意思 有:
- A strap sewn at the side or top rear of a boot to help in pulling the boot on.
- A means of advancing oneself or accomplishing something without aid, as in “He used his business experience as a bootstrap to win voters”.
- (computing) The process by which the operating system of a computer is loaded into its memory
- (computing) The process necessary to compile the tools that will be used to compile the rest of the system or program.
- (statistics) Any method or instance of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution.
Clear now? Ok, let’s get started!
P187
The bootstrap is a statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method.
在实验时,我们常用的一种手段是:
- 自己制定一个或者多个 parameter,比如
- 按 parameter 来 generate samples
- 在这些 sample 上做 estimate,看是否接近 parameter
但是 for real data we cannot generate new samples from the original population,这时就可以用 bootstrap 了。
The bootstrap approach allows us to emulate the process of obtaining new sample sets, so that we can estimate without generating additional samples.
P189 阐述了 bootstrap 的做法,其实挺简单:
- 假设原 data set 有
个 observation - Randomly select
observations from the data set to produce a bootstrap data set. This sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set. - 假设生成了
个 bootstrap data set,我们就在这 个 bootstrap data set 上做 estimate
~~~~~ 2016.08.16 补充 start ~~~~~
其实我觉得 wikipedia: Bootstrap aggregating 的解释更清楚一点:
Given a standard training set
of size , bagging generates new training sets , each of size , by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each . If , then for large , the set is expected to have the fraction of the unique examples of , the rest being duplicates. This kind of sample is known as a bootstrap sample. The models are fitted using the above bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
~~~~~ 2016.08.16 补充 end ~~~~~
3. Lab: Cross-Validation and the BootstrapPermalink
3.1 The Validation Set ApproachPermalink
> library(ISLR)
> set.seed(1)
> train = sample(392,196) ## 从 [1,392] 中选择 196 个整数
> lm.fit = lm(mpg~horsepower, data=Auto, subset=train)
> mean((Auto$mpg - predict(lm.fit, Auto))[-train]^2) ## MSE
3.2 Leave-One-Out Cross-ValidationPermalink
The LOOCV estimate can be automatically computed for any generalized linear model using the glm()
and cv.glm()
functions together.
In previous lab, we used the glm()
function to perform logistic regression by passing in the family="binomial"
argument. If we don’t pass this family
argument, then it performs linear regression, just like the lm() function. Then the result can be passed to cv.glm()
to perform the LOOCV estimate.
The cv.glm()
function is part of the boot
library.
> library(boot)
> glm.fit = glm(mpg~horsepower, data=Auto)
> cv.err = cv.glm(Auto, glm.fit)
> cv.err$delta
1 1
24.23 24.23
这里注意下 delta
。cv.glm
本身是为 cv.glm
其实计算了两个 MSE,都放在 delta
里传回来了:
- The first component is the raw cross-validation estimate of prediction error.
- The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
所以这个 adjusted cross-validation estimate 是仅当用
3.3 k-Fold Cross-ValidationPermalink
> set.seed(17)
> cv.error.10 = rep(0,10)
> for (i in 1:10) {
+ glm.fit=glm(mpg~poly(horsepower,i), data=Auto)
+ cv.error.10[i]=cv.glm(Auto, glm.fit, K=10)$delta [1]
+ }
> cv.error.10
[1] 24.21 19.19 19.31 19.34 18.88 19.02 18.90 19.71 18.95 19.50
In principle, the computation time for LOOCV for a least squares linear model should be faster than for cv.glm()
function does not make use of this formula.
3.4 The BootstrapPermalink
3.4.1 Estimating the Accuracy of a Statistic of InterestPermalink
Performing a bootstrap analysis in R entails only two steps.
- First, we must create a function that computes the statistic of interest.
- Second, we use the
boot()
function, which is part of theboot
library, to perform the bootstrap.
We use the Portfolio
data set in the ISLR
package here.
we first create a function, alpha.fn()
, which takes as input the
> alpha.fn = function(data, index){
+ X=data$X[index]
+ Y=data$Y[index]
+ return ((var(Y)-cov(X,Y)) / (var(X)+var(Y)-2*cov(X,Y)))
+ }
在用 boot()
之前,我们可以自己先手动模拟一下:
> set.seed(1)
> alpha.fn(Portfolio, sample(100,100,replace=T)) ## nrow(Portfolio) = 100
[1] 0.596
We can implement a bootstrap analysis by performing this command many times, recording all of the corresponding estimates for boot()
function automates this approach.
> boot(Portfolio, alpha.fn, R=1000) # R = 1000 is the number of bootstrap replicates, i.e. the B number mentioned in Section 2
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)
Bootstrap Statistics :
original bias std . error
t1* 0.5758 -7.315e -05 0.0886
3.4.2 Estimating the Accuracy of a Linear Regression ModelPermalink
P195 没啥新内容
P196 论述了 CV 没有依赖某些假设,所以得到 estimate 比那些依赖假设的公式其实要来的准。
注意下 xxx.fn
这个函数可以 return 多个值,比如 return (coef(lm(mpg~horsepower, data=data, subset=index)))
这样。我们知道 coef()
是 return 了 slope 和 intercept 两个值,这样 boot()
的结果里 Bootstrap Statistics 就有 t1*
、t2*
两行,依此类推。
Comments