ISL: Linear Regression - Part 1
总结自 Chapter 3, An Introduction to Statistical Learning.
0. OverviewPermalink
Linear Regression is a supervised learning approach, especially useful for predicting a quantitative response.
It serves to answer these questions:
- Is there a relationship, between
and ? - How strong is the relationship?
- Which
’s contribute to ? - How accurately can we estimate the effect of each
on ? - How accurately can we predict future
? - Is the relationship linear?
- 非常好的一个问题,回归问题本质
- Is there synergy among
’s?- Perhaps spending
x_a 50,000 on Results in more sales than allocating $100,000 to either individually. In marketing, this is known as a synergy ([ˈsɪnədʒi], 协同) effect, while in statistics it is called an interaction effect.
- Perhaps spending
1. Simple Linear RegressionPermalink
1.1 ModelPermalink
The simplicity of the method lies in the fact that it predicts a quantitative response
We can describe this relationship as “regressing
where
We also assume that the true relationship between
The error term
这个 error term
1.2 Estimating the CoefficientsPermalink
We want to find an intercept
There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion.
Before introducing the least squares approach, let’s meet residual first.
1.2.1 Residual and RSSPermalink
Let
Then
We define the residual sum of squares (RSS) as
1.2.2 and Permalink
The least squares approach chooses
where
In other words,
1.3 Assessing the Accuracy of the Coefficient EstimatesPermalink
1.3.1 True RelationshipPermalink
注意这里有三个 Relationship 层次:
- (i) the unknown true relationship,
- (ii) we assume the relationship is linear,
- (iii) we estimate the coefficients of this assumed linear relationship, based on the training data,
- (iii) we estimate the coefficients of this assumed linear relationship, based on the training data,
- (ii) we assume the relationship is linear,
或者你可以把 (i)~(ii) 和 (ii)~(iii) 都看做 “true relationship ~ estimate” 的关系。
Assessing the Accuracy of the Coefficient Estimates,那肯定是 层次(ii) 和 层次(iii) 之间的问题。
1.3.2 Estimate BasisPermalink
Using information from a sample to estimate characteristics of a large population is a standard statistical approach.
For example, suppose that we are interested in knowing the population mean
Unbiased EstimatePermalink
If we use the sample mean
How far off will a single estimate be?Permalink
We have established that the average of
where
Roughly speaking, the standard error tells us the average amount that this estimate
1.3.3 From to and Permalink
Population Regression Line and Least Squares LinePermalink
The model given by
The least squares regression coefficient estimates
AnalogyPermalink
OO | XX |
---|---|
population | population regression line |
sample | least squares line |
population mean |
population regression line coefficient |
sample mean |
least squares regression coefficient estimates |
1.3.4 Accuracy Measurements for and Permalink
They are unbiasedPermalink
The property of unbiasedness holds for the least squares coefficient estimates given by
Their Standard ErrorPermalink
In a similar vein, we can measure how close
where
Notes:
- In general,
is not known, but can be estimated from the data. This estimate is known as the residual standard error, and is given by the formula .- Strictly speaking, When
is estimated from the data, we should write to indicate that an estimate has been made, but for simplicity of notation we will drop this extra “hat”.
- Strictly speaking, When
- For
to be strictly valid, we need to assume that the errors for each observation are uncorrelated with common variance . This is clearly not true in least squares fit, but the formula still turns out to be a good approximation. - When
’s are more spread out, is smaller; intuitively we have more leverage to estimate a slope when this is the case. (我觉得这里的意思是 ’s 越分散,我们对 slope 的 accuracy 的 confidence 就越大) - When
, and .
Their 95% CIPermalink
According to Central Limit Theorem, the 95% confidence intervals for
A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.
Hypothesis Tests on the Coefficients (t-statistic and t-test)Permalink
The most common hypothesis test involves testing the null hypothesis of
versus the alternative hypothesis
Mathematically, this corresponds to testing
versus
To test the null hypothesis, we need to determine whether
How far is far enough? This depends on the accuracy of
In practice, we compute a t-statistic, given by
which measures the number of standard deviations that
当
题外话:此时这个 t-statistic 满足的是 degree of freedom (df) 为 n-2 的 t-distribution。见 t-statistic - Definition:
… (n − k) degrees of freedom, where n is the number of observations, and k is the number of regressors (including the intercept).
We reject the null hypothesis — that is, we declare a relationship to exist between
1.4 Assessing the Accuracy of the ModelPermalink
和 1.3 Assessing the Accuracy of the Coefficient Estimates 一样,这里讨论的仍然是 层次(ii) 和 层次(iii) 之间的关系,还没有涉及到 层次(i) 的 true relationship。
稍有点不同的是,1.3 Assessing the Accuracy of the Coefficient Estimates 讨论的是单个的 Accuracy of
好,下面开始正文。
Once we have rejected the null hypothesis
1.4.1 RSE: Residual Standard ErrorPermalink
Recall from the model
The RSE is an estimate of the standard deviation of
This means, if the model were correct and the true values of the unknown coefficients
The RSE is considered a measure of the lack of fit of the model
1.4.2 StatisticPermalink
The RSE provides an absolute measure of lack of fit of the model
The
where
注意下 acronym [‘ækrənɪm]
- RSS: Residual Sum of Squares
- RSE: Residual Standard Error
- TSS: Total Sum of Squares
Notes:
- TSS measures the total variance in the response
, and can be thought of as the amount of variability inherent in the response before the regression is performed. - In contrast, RSS measures the amount of variability that is left unexplained after performing the regression.
- Hence, TSS−RSS measures the amount of variability in the response that is explained by performing the regression, and
measures the proportion of variability in that can be explained using . - An
near 0 indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent error is high, or both
Though a proportion, it can still be challenging to determine what is a good
Only in simple linear regression setting,
2. Multiple Linear RegressionPermalink
2.1 ModelPermalink
We interpret
2.2 Estimating the CoefficientsPermalink
The regression coefficients
The parameters are estimated using the same least squares approach. We choose
The values
2.3 Question 1: Is There a Relationship Between the Response and Predictors? Or, among , is there at least one ? (F-statistic and F-test)Permalink
We test the null hypothesis,
versus the alternative
我们做一个 F-statistic 来 perform hypothesis test
If the linear model assumptions are correct, we would have
If
Hence, when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, if
However, what if the F-statistic had been closer to 1? How large does the F-statistic need to be before we can reject
When
For any given value of
与 t-test 的 p-value 一样:
- p-value 趋近于 0,表示 tends to reject
, i.e. 至少存在一个 , i.e. there is a relationship between the response and predictors - p-value 很大,表示 tends to accept
, i.e. 所有的 , i.e. there is no relationship between the response and predictors
If
P77 提到一个重要的观点,不用使用 t-statistic and p-value for each individual predictor 来代替 F-statistic and its p-value, in Multiple Linear Regression setting.
2.4 Question 2: How to Decide on Important Variables? Or, do all the predictors help to explain , or is only a subset of the predictors useful?Permalink
As discussed in the previous section, the first step in a multiple regression analysis is to compute the F-statistic and to examine the associated p-value. If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder: which ones?
P78 再次提出,individual p-values 在
The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection.
选择方法:
种 predictor 组合一个一个的试- 以 “RSS 越低越好” 为标准,逐步改进,达到某个指标(比如 RSS 低于某个值或者最多改进 n 次)后停止:
- Forward selection: We begin with the null model — a model that contains an intercept only but no predictors. Then try
simple linear regressions and pick the one with the lowest RSS. Then try two-variable models… - Backward selection. We start with all variables in the model, and remove the variable with the largest p-value — that is, the variable that is the least statistically significant. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.
- Mixed selection: We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit. If at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.
- Forward selection: We begin with the null model — a model that contains an intercept only but no predictors. Then try
Notes:
时,forward selection 是一个有效的处理手段(forward selection 对 and 没有什么要求) 时,无法使用 backward selection.- Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.
其他的 statistics that can be used to judge the quality of a model include
- Mallow’s
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)
- Adjusted
2.5 Question 3: How to Measure the Model Fit? Or, how well does the model fit the data?Permalink
Recall that in simple linear regression setting,
In addition to looking at the RSE and
In particular, the linear model seems to overestimate
sales
for instances in which most of the advertising money was spent exclusively on eitherTV
orradio
. It underestimatessales
for instances where the budget was split between the two media. This pronounced non-linear pattern cannot be modeled accurately using linear regression. It suggests a synergy or interaction effect between the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium.
2.6 Question 4: How accurate is our prediction?Permalink
2.6.1 CI for Permalink
The coefficient estimates $\hat{\beta}0, \hat{\beta}_1, \cdots, \hat{\beta}_p
is only an estimate for the true population regression plane
which is part of the true relationship
The inaccuracy in the coefficient estimates is related to the reducible error and we can compute a confidence interval in order to determine how close
2.6.2 Model BiasPermalink
In practice assuming a linear model for f(X) is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias.
这里我们不讨论 model bias, operate as if the linear model were correct.
2.6.3 Prediction IntervalsPermalink
Even if we knew the true values of the paramters, the response value cannot be predicted perfectly because of the random error
How much will
We interpret the 95% PI of
Comments