Ensemble Methods in Machine Learning
总结自 Ensemble Methods in Machine Learning
AbstractPermalink
Ensemble methods are learning algorithms that construct a set of classifier and then classify new data point by taking a (weighted) vote of their predictions. The original ensemble methods is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
1. IntroductionPermalink
We will consider only classification here rather than regression.
An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way (typically by weighted or unweighted voting) to classify new examples.
A necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse.
- By accurate we mean that a classifier has an error rate better than random guessing on new
values.- i.e.
- The probability that the majority vote will be wrong actually follow a binomial distribution.
- i.e.
- By diverse we mean that two classifiers make different errors on new data points.
- If classifiers
are identical (i.e. not diverse), when is wrong, and will also be wrong
- If classifiers
It is often possible to construct very good ensembles. There are three fundamental reason for this.
Given
- search space or hypothesis space
, the hypothesis (function) provided by the classifier , the true hypothesis (function)
- Statistical reason: An ensemble can “average” their votes and reduce the risk of choosing the wrong classifier.
- The inner blue curve denotes the set of hypotheses that all give good accuracy on the training data.
- Computational reason: An ensemble constructed by running the local search from many different starting points may provide a better approximation to the true unknown function.
- Representational reason: In most application of machine learning, the true function
cannot be represented by any of the hypotheses in . By forming weighted sums of hypotheses drawn from , it may be possible to expand the space of representable function.
2. Methods for Constructing EnsemblesPermalink
2.1 Bayesian Voting: Enumerating the HypothesesPermalink
In a Bayesian probabilistic setting, each hypothesis
is the true value is the predicted value (according to the value of )
Given a new data point
We can rewrite this as weighted sum over all hypotheses in
We can view this as an ensemble method in which the ensemble consists of all the hypotheses in
- When the training sample is small, many
will have significantly large posterior probability. - When the training sample is large, typically only one
has substantial posterior probability.
In some learning problems, it is possible to completely enumerate each
In complex problem where
The most idealized aspect of the Bayesian analysis is the prior belief
2.2 Manipulating the Training Examples to Generate Multiple HypothesesPermalink
The second method for constructing ensembles is to run the learning algorithm several times, each time with a different subset of the training examples, to generate multiple hypotheses.
This technique works especially well for unstable learning algorithms–algorithms whose output classifier undergoes major changes in response to small changes in the training data.
- Decision trees, neural networks, and rule learning algorithms are all unstable.
- Linear regression, nearest neighbor, and linear threshold algorithms are generally very stable.
Here we introduce 3 detailed methods of manipulating the training data.
- Bagging: Bagging run a learning algorithm several times, each time with
training examples drawn randomly with replacement from the original training set of items. Such a training set is called a bootstrap replicate of the original training set, and the technique is called bootstrap aggregation, Bagging for short. Each bootstrap replicate contains, on the average, 63.2% of the original training set. - N-fold running: E.g. the training set can be randomly divided into 10 disjoint (不相交的) subset. Then 10 overlapping training set can be constructed by dropping out a different one of the 10 subsets. Ensembles constructed in this way are sometimes called Cross-validated Committees.
- AdaBoost: AdaBoost maintains a set of weights over the training examples. In each iteration
, the learning algorithm is invoked to minimize the weighted error on the training set, and it returns an hypothesis . The weighted error of is computed and applied to update the weights on the training examples. The effect of the change in weights is to place more weight on misclassified examples and less weight on correctly classified one. At last, a classifier weight is computed according to ’s accuracy, and the final classifier .- AdaBoost can be viewed related to margin. See the paper for detail.
2.3 Manipulating the Input FeaturesPermalink
For example, in a project to identify volcanoes on Venus, Cherkauer (1996) trained an ensemble of 32 neural networks, based on 8 different subsets of the 119 available input features and 4 different network sizes. The input features subsets were selected (by hand) to group together features that were based on different image processing operations (such as principle component analysis and fast Fourier transform).
Obviously, this technique only works when the input features are highly redundant.
2.4 Manipulating the Output TargetsPermalink
Dietterich & Bakiri (1995) describe a technique called Error-Correcting Output Coding (ECOC). Suppose that the number of classes,
By repeating this process
Now given a new data point
这个方法也可以用 encoding 的方式来解读,具体见 paper.
2.5 Injecting RandomnessPermalink
这个很好理解,需要人工选择的地方,全部用随机就好了。
3. Comparing Different Ensemble MethodsPermalink
前面讲了 ensemble 比 individual 好的原因有 Statistical、Computational 和 Representational 三个方面,换个说法就是 ensemble 相比 individual 解决了 Statistical、Computational 和 Representational 这三个方面的问题。在比较 ensemble 的时候可以用这个观点来阐述,因为有的 ensemble 只解决了一个问题,有的解决了两个 etc.
overfitting 也和这几个问题有关。
还有些比较的结果是由于算法自身的其他性质造成的。
具体见 papar。
4. ConclusionsPermalink
Omitted here.
Comments