20 minute read

总结自 Chapter 3, An Introduction to Statistical Learning.


0. OverviewPermalink

Linear Regression is a supervised learning approach, especially useful for predicting a quantitative response.

It serves to answer these questions:

  1. Is there a relationship, betweenX and Y?
  2. How strong is the relationship?
  3. Which xi’s contribute to Y?
  4. How accurately can we estimate the effect of each xi onY?
  5. How accurately can we predict future Y?
  6. Is the relationship linear?
    • 非常好的一个问题,回归问题本质
  7. Is there synergy among xi’s?
    • Perhaps spending 50,000onx_aand50,000 on xb Results in more sales Y than allocating $100,000 to either individually. In marketing, this is known as a synergy ([ˈsɪnədʒi], 协同) effect, while in statistics it is called an interaction effect.

1. Simple Linear RegressionPermalink

1.1 ModelPermalink

The simplicity of the method lies in the fact that it predicts a quantitative response Y on a single predictor X. It assumes that there is approximately a linear relationship betweenX and Y, as:

(1.1)Yβ0+β1X

We can describe this relationship as “regressing Y onto X”. Together, β0 and β1 are known as the model coefficients or parameters. Once we have used our training data to produce estimates β^0 and β^1 for the model coefficients, we can predict

(1.2)y^=β^0+β^1x

where y^ indicates a prediction of Y on the basis of X=x. Here we use a hat symbol, ˆ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.

We also assume that the true relationship betweenX and Y takes the form Y=f(X)+ϵ for some unknown functionf, where ϵ is a random error term with mean(ϵ)=0. If f is to be approximated by a linear function, then we can write this relationship as

(1.3)Y=β0+β1X+ϵ

The error term ϵ is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation iny, and there may be measurement error. We typically assume that the error term is independent of X.

这个 error term ϵ 有点物理上 “测量误差” 的感觉。我们做 estimate 就像是在 “估读”。

1.2 Estimating the CoefficientsPermalink

We want to find an intercept β^0 and a slope β^1 such that the resulting line is as close as possible to the n training data points.

There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion.

Before introducing the least squares approach, let’s meet residual first.

1.2.1 Residual and RSSPermalink

Let y^i=β^0+β^1xi be the prediction for y based on the i^th value of X.

ThenEi=yiyi^Represents the i^th residual — this is the difference between the i^th observed response value and the i^th response value that is predicted by our linear model.

We define the residual sum of squares (RSS) as

RSS=e12+e22++en2(1.4)=(y1β^0β^1x1)2++(ynβ^0β^1xn)2

1.2.2β^0 and β^1Permalink

The least squares approach chooses β^0 and β^1 to minimize the RSS, which are

β^1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2(1.5)β^0=y¯β^1x¯

where y¯1ni=1nyi and x¯1ni=1nxi are the sample means.

In other words, (1.5) defines the least squares coefficient estimates for simple linear regression.

1.3 Assessing the Accuracy of the Coefficient EstimatesPermalink

1.3.1 True RelationshipPermalink

注意这里有三个 Relationship 层次:

  • (i) the unknown true relationship, y=f(X)+ϵ
    • (ii) we assume the relationship is linear, y=β0+β1X+ϵ
      • (iii) we estimate the coefficients of this assumed linear relationship, based on the training data, Y^=β^0+β^1X

或者你可以把 (i)~(ii) 和 (ii)~(iii) 都看做 “true relationship ~ estimate” 的关系。

Assessing the Accuracy of the Coefficient Estimates,那肯定是 层次(ii) 和 层次(iii) 之间的问题。

1.3.2 Estimate BasisPermalink

Using information from a sample to estimate characteristics of a large population is a standard statistical approach.

For example, suppose that we are interested in knowing the population meanμ of some random variable y. Unfortunately, μ is unknown, but we do have access to n observations from y, which we can write as y1,,yn, and which we can use to estimate μ. A reasonable estimate is μ^=y¯, where y¯=1ni=1nyi is the sample mean. The sample mean and the population mean are different, but in general the sample mean will provide a good estimate of the population mean.

Unbiased EstimatePermalink

If we use the sample meanμ^ to estimate μ, this estimate is unbiased, in the sense that on average, we expect μ^ to equal μ. It means that on the basis of one particular set of observations y1,,yn, μ^ might overestimate μ, and on the basis of another set of observations, μ^ might underestimate μ. But if we could average a huge number of estimates of μ obtained from a huge number of sets of observations, then this average would exactly equal μ. Hence, an unbiased estimator does not systematically over- or under-estimate the true parameter.

How far off will a single estimate μ^ be?Permalink

We have established that the average of μ^’s over many data sets will be very close to μ, but that a single estimate μ^ may be a substantial underestimate or overestimate of μ. How far off will μ^ be? In general, we answer this question by computing the standard error of μ^, as

(1.6)Var(μ^)=SE(μ^)2=σ2n

where σ is the standard deviation of the realizations yi of Y (In probability and statistics, an observed valuea is also known as a realization, so you can refer “realizations” just to “a data set”).

Roughly speaking, the standard error tells us the average amount that this estimate μ^ differs from the actual value of μ(Central Limit Theorem, μ^N(μ,σ2n)). It also tells us how this deviation shrinks with n — the more observations we have, the smaller the standard error of μ^.

1.3.3 From μ^ to β^0 and β^1Permalink

μ^ 是 sample mean,同时也是 estimate for population mean。类比一下,β^0 and β^1 是 estimates for the model coefficients,那应该也可以叫做 “sample model coefficients”。从这个意义上来说,sample 和 training data set 也是同一层次的概念。

Population Regression Line and Least Squares LinePermalink

The model given by (1.3) defines the population regression line, which is the best linear approximation to the true relationship between X and y.

The least squares regression coefficient estimates (1.5) characterize the least squares line.

AnalogyPermalink
OO XX
population population regression line
sample least squares line
population mean μ population regression line coefficient β0 and β1
sample mean μ^ least squares regression coefficient estimates β^0 and β^1

1.3.4 Accuracy Measurements for β^0 and β^1Permalink

They are unbiasedPermalink

The property of unbiasedness holds for the least squares coefficient estimates given by (1.5) as well: if we estimate β0 and β1 on the basis of a particular data set, then our estimates won’t be exactly equal to β0 and β1. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on.

Their Standard ErrorPermalink

In a similar vein, we can measure how close β^0 and β^1 are to the true values β0 and β1, by computing the standard errors

SE(β^0)=σ2[1n+x¯2i=1n(xix¯)2](1.7)SE(β^1)=σ2i=1n(xix¯)2

where σ2=Var(ϵ)

Notes:

  • In general, σ2 is not known, but can be estimated from the data. This estimate is known as the residual standard error, and is given by the formulaRSE=RSS/(n2).
    • Strictly speaking, When σ2 is estimated from the data, we should write SE^(β^1) to indicate that an estimate has been made, but for simplicity of notation we will drop this extra “hat”.
  • For (1.7) to be strictly valid, we need to assume that the errors ϵi for each observation are uncorrelated with common variance σ2. This is clearly not true in least squares fit, but the formula still turns out to be a good approximation.
  • When Xi’s are more spread out, SE(β^1) is smaller; intuitively we have more leverage to estimate a slope when this is the case. (我觉得这里的意思是 Xi’s 越分散,我们对 slope 的 accuracy 的 confidence 就越大)
  • When x¯=0, SE(β^0)=SE(μ^) and β^0=y¯.
Their 95% CIPermalink

According to Central Limit Theorem, the 95% confidence intervals for β0 and β1 approximately take the form

β^0±2SE(β^0)β^1±2SE(β^1)

A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

Hypothesis Tests on the Coefficients (t-statistic and t-test)Permalink

The most common hypothesis test involves testing the null hypothesis of

H0 : There is no relationship between X and Y

versus the alternative hypothesis

Ha : There is some relationship between X and Y

Mathematically, this corresponds to testing

H0 : β1=0

versus

Ha : β10

To test the null hypothesis, we need to determine whether β^1, our estimate for β1, is sufficiently far from 0 that we can be confident that β1 is non-zero.

How far is far enough? This depends on the accuracy of β^1 — that is, it depends on SE(β^1). If SE(β^1) is small, then even relatively small values of β^1 may provide strong evidence that β10. In contrast, if SE(β^1) is large, then β^1 must be large in absolute value in order for us to reject the null hypothesis.

In practice, we compute a t-statistic, given by

(1.8)t=β^10SE(β^1)

which measures the number of standard deviations that β^1 is away from 0.

β1=0 时,β1μ=0,这个 t-statistic 正好满足 x¯μSE(x¯) 的形式,所以我们从一个单纯的 t-statistic 升级为 t-stastistic following a t-distribution,然后我们这个 Hypothesis Test 就成了名正言顺的 t-test。

题外话:此时这个 t-statistic 满足的是 degree of freedom (df) 为 n-2 的 t-distribution。见 t-statistic - Definition:

… (n − k) degrees of freedom, where n is the number of observations, and k is the number of regressors (including the intercept).

We reject the null hypothesis — that is, we declare a relationship to exist between X and y — if the p-value, which indicates the probabilities of seeing such t-statistic if H0 is true, is small enough. Typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%. If p-value is below 5% or 1%, we can conclude that β10.

1.4 Assessing the Accuracy of the ModelPermalink

1.3 Assessing the Accuracy of the Coefficient Estimates 一样,这里讨论的仍然是 层次(ii) 和 层次(iii) 之间的关系,还没有涉及到 层次(i) 的 true relationship。

稍有点不同的是,1.3 Assessing the Accuracy of the Coefficient Estimates 讨论的是单个的 Accuracy of β^0 and β^1。这里讨论的是整个 Model 的 Accuracy (against the sample)。

好,下面开始正文。

Once we have rejected the null hypothesis β1=0 in favor of the alternative hypothesis β10, it is natural to want to quantify the extent to which the model fits the data. The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the R2 statistic.

1.4.1 RSE: Residual Standard ErrorPermalink

Recall from the model (1.3) that associated with each observation is an error term ϵi. Due to the presence of the error terms, even if we knew the true regression line (i.e. even if β0 and β1 were known), we would not be able to perfectly predict y from X.

The RSE is an estimate of the standard deviation of ϵ (already mentioned in Their Standard Error). Roughly speaking, it is the average amount that the response will deviate from the true regression line.

(1.9)RSE=1n2RSS=1n2i=1n(yiyi^)2

This means, if the model were correct and the true values of the unknown coefficients β0 and β1 were known exactly, any prediction of y onX would still be off by about RSE units on average. The percentage error would be RSEmean(Y). Whether or not this is an acceptable prediction error depends on the problem context.

The RSE is considered a measure of the lack of fit of the model (1.3) to data, i.e. if underfitting, RSE may be quite large.

1.4.2R2 StatisticPermalink

The RSE provides an absolute measure of lack of fit of the model (1.3) to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE.

The R2 statistic provides an alternative measure of fit. It takes the form of a proportion — the proportion of variance explained by the model — and so it always takes on a value between 0 and 1, and is independent of the scale of y (有点像 PCA 的 95% variance retained 的概念).

(1.10)R2=TSSRSSTSS=1RSSTSS

where TSS=(yiy¯)2 is the total sum of squares.

注意下 acronym [‘ækrənɪm]

  • RSS: Residual Sum of Squares
  • RSE: Residual Standard Error
  • TSS: Total Sum of Squares

Notes:

  • TSS measures the total variance in the response y, and can be thought of as the amount of variability inherent in the response y before the regression is performed.
  • In contrast, RSS measures the amount of variability that is left unexplained after performing the regression.
  • Hence, TSS−RSS measures the amount of variability in the response that is explained by performing the regression, and R2 measures the proportion of variability iny that can be explained using X.
  • An R2 near 0 indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent error Var(ϵ) is high, or both

Though a proportion, it can still be challenging to determine what is a good R2 value, and in general, this will depend on the application. For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model (1.3) is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an R2 value well below 0.1 might be more realistic.

Only in simple linear regression setting, R2=Cor(X,Y)2.

2. Multiple Linear RegressionPermalink

2.1 ModelPermalink

(2.1)Y=β0+β1X1++βpXp+ϵ

We interpret βj as the average effect on y of a one unit increase in j^th predictor, Xj, holding all other predictors fixed.

2.2 Estimating the CoefficientsPermalink

The regression coefficients β0,β1,,βp in (2.1) are unknown, and must be estimated. Given estimates β^0,β^1,,β^p, we can make predictions using the formula

(2.2)Y^=β^0+β^1X1++β^pXp

The parameters are estimated using the same least squares approach. We choose β^0,β^1,,β^p to minimize the sum of squared residuals

RSS=i=1n(yiy^i)2(2.3)=i=1n(yiβ^0β^1xi1β^pxip)2

The values β^0,β^1,,β^p are known as the multiple least squares regression coefficient estimates.

2.3 Question 1: Is There a Relationship Between the Response and Predictors? Or, among β^1,,β^p, is there at least one β^i0? (F-statistic and F-test)Permalink

We test the null hypothesis,

H0 : β1=β2==βp=0

versus the alternative

Ha : at least one βj is non-zero

我们做一个 F-statistic 来 perform hypothesis test

(2.4)F=(TSSRSS)/pRSS/(np1)

If the linear model assumptions are correct, we would have

(2.5)E[RSS/(np1)]=σ2

If H0 is true, we would have

(2.6)E[(TSSRSS)/p]=σ2

Hence, when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, if Ha is true, then E[(TSSRSS)/p]>σ2, so we expect F to be greater than 1.

However, what if the F-statistic had been closer to 1? How large does the F-statistic need to be before we can reject H0 and conclude that there is a relationship? It turns out that the answer depends on the values of n and p. When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small.

When H0 is true and the errors ϵi have a normal distribution, the F-statistic follows an F-distribution. Even if the errors are not normally-distributed, the F-statistic approximately follows an F-distribution provided that the sample size n is large.

For any given value of n and p, we can compute the p-value associated with the F-statistic using F-distribution. Based on this p-value, we can determine whether or not to reject H0.

与 t-test 的 p-value 一样:

  • p-value 趋近于 0,表示 tends to reject H0, i.e. 至少存在一个 βi0, i.e. there is a relationship between the response and predictors
  • p-value 很大,表示 tends to accept H0, i.e. 所有的 βi=0, i.e. there is no relationship between the response and predictors

If p>n then there are more coefficients βi to estimate than observations from which to estimate them. In this case we cannot even fit the multiple linear regression model using least squares, so the F-statistic cannot be used.

P77 提到一个重要的观点,不用使用 t-statistic and p-value for each individual predictor 来代替 F-statistic and its p-value, in Multiple Linear Regression setting.

2.4 Question 2: How to Decide on Important Variables? Or, do all the predictors help to explainy, or is only a subset of the predictors useful?Permalink

As discussed in the previous section, the first step in a multiple regression analysis is to compute the F-statistic and to examine the associated p-value. If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder: which ones?

P78 再次提出,individual p-values 在 p is large 时不可靠,需要谨慎使用。

The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection.

选择方法:

  • 2p 种 predictor 组合一个一个的试
  • 以 “RSS 越低越好” 为标准,逐步改进,达到某个指标(比如 RSS 低于某个值或者最多改进 n 次)后停止:
    • Forward selection: We begin with the null model — a model that contains an intercept only but no predictors. Then tryp simple linear regressions and pick the one with the lowest RSS. Then try two-variable models…
    • Backward selection. We start with all variables in the model, and remove the variable with the largest p-value — that is, the variable that is the least statistically significant. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.
    • Mixed selection: We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit. If at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.

Notes:

  • p>n 时,forward selection 是一个有效的处理手段(forward selection 对 p and n 没有什么要求)
  • p>n 时,无法使用 backward selection.
  • Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.

其他的 statistics that can be used to judge the quality of a model include

  • Mallow’s Cp
  • Akaike information criterion (AIC)
  • Bayesian information criterion (BIC)
  • Adjusted R2

2.5 Question 3: How to Measure the Model Fit? Or, how well does the model fit the data?Permalink

Recall that in simple linear regression setting, R2=Cor(X,Y)2. While in multiple linear regression setting, it turns out R2=Cor(Y,Y^)2. In fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.

R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. This is due to the fact that adding another variable to the least squares equations must allow us to fit the training data (though not necessarily the testing data) more accurately. Thus, the R2 statistic, which is also computed on the training data, must increase. Therefore just a tiny increase inR2 may provides additional evidence that this predictor can be dropped from the model.

In addition to looking at the RSE and R2 statistics just discussed, it can be useful to plot the data. 书中提到了观测到 synergy 现象的例子:

In particular, the linear model seems to overestimate sales for instances in which most of the advertising money was spent exclusively on either TV or radio. It underestimates sales for instances where the budget was split between the two media. This pronounced non-linear pattern cannot be modeled accurately using linear regression. It suggests a synergy or interaction effect between the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium.

2.6 Question 4: How accurate is our prediction?Permalink

2.6.1 CI for y^Permalink

The coefficient estimates $\hat{\beta}0, \hat{\beta}_1, \cdots, \hat{\beta}_pareestimatesfor\beta_0, \beta_1, \cdots, \beta_p$. That is, the _least squares plane

Y^=β^0+β^1X1++β^pXp

is only an estimate for the true population regression plane

f(X)=β0+β1X1++βpXp

which is part of the true relationship

Y=f(X)+ϵ

The inaccuracy in the coefficient estimates is related to the reducible error and we can compute a confidence interval in order to determine how close y^ will be to f(X). We interpret the 95% CI of y^ to mean that, with 95% in probablity the interval will contain the true value of f(X).

2.6.2 Model BiasPermalink

In practice assuming a linear model for f(X) is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias.

这里我们不讨论 model bias, operate as if the linear model were correct.

2.6.3 Prediction IntervalsPermalink

Even if we knew the true values of the paramters, the response value cannot be predicted perfectly because of the random error ϵ. We referred to this as the irreducible error.

How much will yVary from y^? We use prediction intervals to answer this question. Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for f(X) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error).

We interpret the 95% PI of y^ to mean that, with 95% in probablity the interval will contain the true value of y.

Comments