ISL: Classification
总结自 Chapter 4, An Introduction to Statistical Learning.
The process of predicting qualitative responses is known as classification. Predicting a qualitative response for an observation can also be referred to as classifying that observation. Classification techniques are also known as classifiers.
In this chapter we discuss three of the most widely-used classifiers:
- logistic regression
- linear discriminant analysis
- K-nearest neighbors
1. An Overview of ClassificationPermalink
P128
2. Why Not Linear Regression?Permalink
P129
The codings of response would produce fundamentally different linear models that would ultimately lead to different sets of predictions on test observations. And the difference of responses does not make any sense.
Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure.
3. Logistic RegressionPermalink
Rather than modeling this response
3.1 The Logistic ModelPermalink
注意下写法:
一般我们把
If we use a linear regression model to represent these probabilities as
the main problem we would have is that the probablity may fall out of range [0,1].
To avoid this problem, we must model
To fit the model
After a bit of manipulation of
The quantity
means extremely low probablity means extremely high probablity
By taking the logarithm of both sides of
The left-hand side is called the log-odds or logit. We see that the logistic regression model
Therefore increasing
- If
is positive then increasing will be associated with increasing . - If
is negative then increasing will be associated with decreasing .
3.2 Estimating the Regression CoefficientsPermalink
Although we could use (non-linear) least squares to fit the model
The basic intuition behind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for
This intuition can be formalized using a mathematical equation called a likelihood function:
The estimates
Maximum likelihood is a very general approach that is used to fit many of the non-linear models. In the linear regression setting, the least squares approach is in fact a special case of maximum likelihood.
We use z-statistics to perform the hypothesis tests on the coefficients. Take
Then a large absolute value of the z-statistic and a vitual value 0 of p-value indicate evidence to reject the null hypothesis
3.3 Making PredictionsPermalink
P134
3.4 Multiple Logistic RegressionPermalink
By analogy with the extension from simple to multiple linear regression, we can generalize
Equation
Still we use the maximum likelihood method to estimate
As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors. In general, the phenomenon is known as confounding. 具体见 P136,例子和阐述都不错。
3.5 Logistic Regression for >2 Response ClassesPermalink
The two-class logistic regression models discussed in the previous sections have multiple-class extensions, but in practice they tend not to be used all that often. One of the reasons is that the method we discuss in the next section, discriminant analysis, is popular for multiple-class classification. So we just stop here. Simply note that such an approach is possible and is available in R.
4. Linear Discriminant AnalysisPermalink
Logistic Regression 是直接求的
4.1 Using Bayes’ Theorem for ClassificationPermalink
Suppose that we wish to classify an observation into one of
Let
Let
Then Bayes’ theorem states that
In accordance with our earlier notation, we will use the abbreviation
In general, estimating
4.2 Linear Discriminant Analysis for Permalink
P139-142。这公式搬过来我手就要断了……
简单说就是为了 estimate
- Assume that
is normal or Gaussian - Let
be the variance parameter for the class. Then assume
然后不停地套公式,用 estimate 代替 parameter……
4.3 Linear Discriminant Analysis for Permalink
We now extend the LDA classifier to the case of multiple predictors. To do this, we will assume that
The multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors.
P142 起先是介绍了下啥是 multivariate Gaussian distribution,然后又是不停地套公式,用 estimate 代替 parameter……
P145 起又是 True Positive、Sensitivity 那一套,就不赘述了。
P145 结尾解释了 why may LDA have a low sensitivity sometimes:
LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers (if the Gaussian model is correct). That is, the Bayes classifier will yield the smallest possible total number of misclassified observations, irrespective of which class the errors come from. That is, some misclassifications will result from incorrectly assigning a customer who does not default to the default class, and others will result from incorrectly assigning a customer who defaults to the non-default class.
换句话说就是,LDA 只能尽量让
如果我们 lower threshold,比如从
的数量上升 的数量下降 上升 下降
所以这是一个 trade-off。How can we decide which threshold value is best? Such a decision must be based on domain knowledge.
The ROC curve is a popular graphic for simultaneously displaying the TP and FP rate for all possible thresholds. The name “ROC” is historic, and comes from communications theory. It is an acronym for receiver operating characteristics.
- FP (false positive) rate, i.e. 1 - Specificity, is x-axis of ROC
- TP (true positive) rate, i.e. Sensitivity, is y-axis of ROC
- 忘记概念的话请自觉查看 Conditional Probability
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC).
4.4 Quadratic Discriminant AnalysisPermalink
LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is common to all
Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction.
However, unlike LDA, QDA assumes that each class has its own covariance matrix.
P149 小幅数学内容。
Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the
5. A Comparison of Classification MethodsPermalink
logistic regression vs LDA
- Both produce linear decision boundaries.
- The only difference between the two approaches lies in the fact that
- logistic regression performs estimation using maximum likelihood
- whereas LDA uses the estimated mean and variance from a normal distribution
- Since logistic regression and LDA differ only in their fitting procedures, one might expect the two approaches to give similar results. The performance fluctuates basically due to whether these Gaussian assumptions are met or not.
KNN:
- a completely non-parametric approach
- no assumptions are made about the shape of the decision boundary.
- We can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear.
- On the other hand, KNN does not tell us which predictors are important; we don’t get a table of coefficients out of KNN.
QDA:
- QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches.
- Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of problems than can the linear methods.
- Though not as flexible as KNN, QDA can perform better in the presence of a limited number of training observations because it does make some assumptions about the form of the decision boundary.
P153-154 设计了 6 个 Scenario 来测试这些方法的 performance。
- When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well.
- When the boundaries are moderately non-linear, QDA may give better results.
- Finally, for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior.
- But the level of smoothness for a non-parametric approach must be chosen carefully.
最后还提到了加 transformation 越是可行的,但是 performance 需要重新测。If we added all possible quadratic terms and cross-products to LDA, the form of the model would be the same as the QDA model, although the parameter estimates would be different. This device allows us to move somewhere between an LDA and a QDA model.
6. Lab: Logistic Regression, LDA, QDA, and KNNPermalink
6.2 Logistic RegressionPermalink
> library(ISLR)
> names(Smarket)
[1] "Year" "Lag1" "Lag2" "Lag3" "Lag4"
[6] "Lag5" "Volume " "Today" " Direction "
> dim(Smarket)
[1] 1250 9
> summary(Smarket)
> cor(Smarket [,-9]) ## matrix of pairwise correlations, except the qualitative one
Next, we will fit a logistic regression model in order to predict Direction
using Lag1
through Lag5
and Volume
. The glm()
function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm()
function is similar to that of lm()
, except that we must pass in the argument family=binomial
in order to tell R to run a logistic regression rather than some other type of generalized linear model.
> glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial)
> summary(glm.fit)
> coef(glm.fit)
> summary(glm.fit)$coef
The predict()
function can be used to predict the probability that the market will go up, given values of the predictors. The type="response"
option tells R to output probabilities of the form predict()
function, then the probabilities are computed for the training data that was used to fit the logistic regression model.
> glm.probs = predict(glm.fit, type="response")
> glm.probs[1:10]
1 2 3 4 5 6 7 8 9 10
0.507 0.481 0.481 0.515 0.511 0.507 0.493 0.509 0.518 0.489
We know that these values correspond to the probability of the market going up, rather than down, because the contrasts()
function indicates that R has created a dummy variable with a 1 for Up
.
> contrasts(Direction)
Up
Down 0
Up 1
In order to make a prediction, we must convert these predicted probabilities into class labels, Up
or Down
.
> glm.pred = rep("Down", 1250) ## n = 1250
> glm.pred[glm.probs>.5] = "Up"
Given these predictions, the table()
function can be used to produce a confusion matrix.
> table(glm.pred, Smarket$Direction)
Direction
glm.pred Down Up
Down 145 141
Up 457 507
> (507+145)/1250
[1] 0.5216
> mean(glm.pred == Smarket$Direction)
[1] 0.5216
-> ~~~~~~~~~~ 2015.11.09 P.S. Start ~~~~~~~~~~ <-
You can also use confusionMatrix(prediction, reference)
function in caret
package, e.g.
> library("caret")
> lvs <- c("normal", "abnormal")
> truth <- factor(rep(lvs, times = c(86, 258)), levels = rev(lvs))
> pred <- factor(c(rep(lvs, times = c(54, 32)), rep(lvs, times = c(27, 231))), levels = rev(lvs))
> xtab <- table(pred, truth)
> confusionMatrix(xtab)
Confusion Matrix and Statistics
truth
pred abnormal normal
abnormal 231 32
normal 27 54
Accuracy : 0.8285
......
> confusionMatrix(pred, truth) # ditto
See confusionMatrix {caret} for more.
-> ~~~~~~~~~~ 2015.11.09 P.S. End ~~~~~~~~~~ <-
P159 起就是在说做 training set 的事情,只用注意一个 glm()
的 subset
参数用法就可以了:
> train = (Smarket$Year<2005)
> glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train)
6.3 Linear Discriminant AnalysisPermalink
We fit a LDA model using the lda()
function, which is part of the MASS
library. Notice that the syntax for the lda()
function is identical to that of lm()
.
> library(MASS)
> lda.fit=lda(Direction~Lag1+Lag2, data=Smarket, subset=train)
> lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
Prior probabilities of groups :
Down Up
0.492 0.508
Group means :
Lag1 Lag2
Down 0.0428 0.0339
Up -0.0395 -0.0313
Coefficients of linear discriminants:
LD1
Lag1 -0.642
Lag2 -0.514
> plot(lda.fit)
The LDA output indicates that
It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of
The coefficients of linear discriminants output provides the linear combination of Lag1 and Lag2 that are used to form the LDA decision rule, i.e $ (-0.642Lag1) + (-0.514Lag2) $.
The plot()
function produces plots of the linear discriminants, obtained by computing
The predict()
function returns a list with three elements.
class
, contains LDA’s predictions.posterior
, is a matrix whose column contains the posterior probability that the corresponding observation belongs to the class, i.e the .x
, contains the linear discriminants.
> train = (Smarket$Year<2005)
> Smarket.2005 = Smarket[!train,]
> Direction.2005 = Smarket$Direction[!train]
> lda.pred = predict(lda.fit, Smarket.2005)
> names(lda.pred)
[1] "class" "posterior " "x"
> lda.class = lda.pred$class
> table(lda.class, Direction.2005)
Direction.2005
lda.pred Down Up
Down 35 35
Up 76 106
> mean(lda.class == Direction.2005)
[1] 0.56
> sum(lda.pred$posterior[,1] >= .5)
[1] 70
> sum(lda.pred$posterior[,1] < .5)
[1] 182
Notice that the posterior probability output by the model corresponds to the probability of down
. So you’d better take a peek before performing further tasks.
> lda.pred$posterior[1:20 ,1]
> lda.class[1:20]
6.4 Quadratic Discriminant AnalysisPermalink
QDA is implemented in R using the qda()
function, which is also part of the MASS
library. The syntax is identical to that of lda()
.
> qda.fit = qda(Direction~Lag1+Lag2, data=Smarket, subset=train)
The predict()
function works in exactly the same fashion as for LDA.
> qda.class = predict(qda.fit, Smarket.2005)$class
> table(qda.class, Direction.2005)
Direction.2005
qda.class Down Up
Down 30 20
Up 81 121
> mean(qda.class == Direction.2005)
[1] 0.599
6.5 K-Nearest NeighborsPermalink
knn()
function is part of the class
library. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn()
forms predictions using a single command. The function requires four inputs.
- A matrix of training
- A matrix of testing
- A vector of training
- A value for
, the number of nearest neighbors to be used by the classifier.
> library(class)
> train.X = cbind(Smarket$Lag1, Smarket$Lag2)[train,]
> test.X = cbind(Smarket$Lag1, Smarket$Lag2)[!train,]
> train.Direction = Smarket$Direction[train]
We set a random seed before we apply knn()
because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.
> set.seed(1)
> knn.pred = knn(train.X, test.X, train.Direction, k=1)
> table(knn.pred, Direction.2005)
Direction.2005
knn.pred Down Up
Down 43 58
Up 68 83
> (83+43)/252
[1] 0.5
The results using
6.6 An Application to Caravan Insurance DataPermalink
P165,一个具体的例子,业务分析值得一看。技术上需要注意的一个地方是: The scale()
function standardize the data so that all variables are given a mean of zero and a standard deviation of one.
## exclude column 86 because that is the qualitative Purchase variable
> standardized.X = scale(Caravan[,-86])
> var(Caravan[,1])
[1] 165
> var(Caravan[,2])
[1] 0.165
> var(standardized.X[,1])
[1] 1
> var(standardized.X[,2])
[1] 1
Now every column of standardized.X
has a standard deviation of one and a mean of zero.
Comments