Terminology Recap: Sampling / Sample / Sample Space / Experiment / Statistical Model / Statistic / Estimator / Empirical Distribution / Resampling / CV / Jackknife / Bootstrap / Bagging / Likelihood / Estimation and Machine Learning
1. Sampling from a Probability DistributionPermalink
我觉得最好的解释在 kccu@StackExchange: How does one formally define sampling from a probability distribution?:
I have just described how to go from a random variable to its distribution function, but we can go the other way. Namely, given a distribution (
), we can sample a random variable from it, by which we mean we choose a probability space and a function satisfying for all . It is not obvious that such a random variable must exist! But in fact one does provided you have a valid distribution function.
我们回头看一下没什么太大用的 sampling 的定义:Wikipedia: Sampling (statistics):
… sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt for the samples to represent the population in question. Two advantages of sampling are lower cost and faster data collection than measuring the entire population.
所以:
- 假设真实问题领域有一个 probability space
- 统计学上这个 outcome 的集合
在这里称为 population
- 统计学上这个 outcome 的集合
- 我们 somehow 从一个 distribution (搞不好还是从 sample 中估计出来的) 反向推出一个 probability space
(通过 sampling)- 这个
在这里称为一个 statistical sample - 搞不好这里还有个 “先有鸡还是先有蛋的问题”:你是先拿出一个 sample 来 estimate 一个 distribution,还是先拿出一个 distribution 反推出一个 sample?
- 这个
2. SamplePermalink
虽然 sampling 的定义很清晰,但是很遗憾的是,一个 sample 可以指:
- 一个 random variable
(来自 sampling)-
Terminology Recap: Random Variable / Distribution / PMF / PDF / Independence / Marginal Distribution / Joint Distribution / Conditional Random Variable 里我们已经见识到,一个 random variable vector (假设长度为
) 也可以看做一个 random variable (引入 joint distribution),所以一个 sample 也可能是一个 random variable vector,所以也可以理解为 个 random variables,所以也可以理解为 个 samples ()
-
Terminology Recap: Random Variable / Distribution / PMF / PDF / Independence / Marginal Distribution / Joint Distribution / Conditional Random Variable 里我们已经见识到,一个 random variable vector (假设长度为
- 一个
(来自 sampling)- 假设
,那么这个 个 outcomes 又称为 个 sample values - 所以一个 sample 也可以理解为
个 sample values- 你要是理解成这
个 sample values 是来自 个 random variables,从而算是来自 个 samples,我觉得你是把问题复杂化了。Don’t do this.
- 你要是理解成这
- 假设
3. Sample Space / ExperimentPermalink
虽然看上出很自然,但是 sampling 出来的
而真正的 sample space 永远要和 experiment 联系起来。遗憾的是,experiment 没有 formal definition:Wikipedia: Experiment (probability theory)
In probability theory, an experiment or trial is any procedure that can be infinitely repeated and has a well-defined set of possible outcomes, known as the sample space.
但是不要紧,你记住一点就行:一个 experiment 一定对应一个 probability space
-
然后这个
就叫 sample space,至于为什么这么叫,我倒是也想问问最初这么命名的人 ()
- 发现了没?sample space 和 sampling 在定义上是没有直接关联的!老子信了你的邪! (当然你硬要从 sample 的定义上把这两者联系起来也是行得通的,但是,有帮助到你理解吗?)
- sample space
不一定是 population ,但有可能是- sample space 小于 population 的例子
- 比如 population 全国人口的身高
- 然后 experiment 是在北京市抽查 100 人的身高,然后恰巧全国最高的人不在北京,你在北京怎么抽也不可能抽到这个最大的身高值
- sample space 等于 population 的例子:
- experiment 是 “roll a dice once”
- sample space 小于 population 的例子
4. Statistical ModelPermalink
4.1 (预备知识) Mathematical ModelPermalink
Mathematical model 是一个抽象概念:Wikipedia: Mathematical model:
A mathematical model is a description of a system using mathematical concepts and language.
- 那至于 system 是什么?这里就不展开了,也是个抽象的概念
Mathematical models are usually composed of relationships and variables.
- Relationships can be described by operators, such as algebraic operators, functions, differential operators, etc.
- Variables are abstractions of system parameters of interest, that can be quantified.
参照不同的标准,Mathematical model 可以分为以下几个大类:
-
Linear vs. nonlinear
- 判断标准是:operator 是否是 linear 的
-
Static vs. dynamic
- A dynamic model accounts for time-dependent changes in the state of the system。
- 一般会用到 differential equations (微分方程) 或者 difference equations (差分方程)
- 注:差分方程其实就是 递推关系 (recurrence relation),类似
这种
- 注:差分方程其实就是 递推关系 (recurrence relation),类似
- 一般会用到 differential equations (微分方程) 或者 difference equations (差分方程)
- A static (or steady-state) model calculates the system in equilibrium, and thus is time-invariant.
- A dynamic model accounts for time-dependent changes in the state of the system。
-
Explicit vs. implicit
- Explicit 指 “all of the input parameters of the overall model are known, and the output parameters can be calculated by a finite series of computations”
- Implicit 指 “the output parameters are unknown, and the corresponding inputs must be solved for by an iterative procedure”
- Newton’s method 即是这类 iterative procedure 的一种
- Discrete vs. continuous
-
Deterministic vs. probabilistic (stochastic)
- A deterministic model is one in which every set of variable states is uniquely determined by parameters in the model and by sets of previous states of these variables; therefore, a deterministic model always performs the same way for a given set of initial conditions.
- In a stochastic model, randomness is present, and variable states are not described by unique values, but rather by probability distributions.
- 下面所说的 probability model 和 statistical model 都属于 stochastic models 的大类
-
Deductive vs. inductive
- A deductive model is a logical structure based on a theory.
- An inductive model arises from empirical findings and generalization from them.
4.2 (预备知识) Probability ModelPermalink
废话少说:A probability model is defined by a probability space
- 注:
是 sample space
4.3 Statistical ModelPermalink
根据 MIT 18.650 Statistics for Applications:
Let the observed outcome of a statistical experiment be a sample of
is the sampel space, is a family of probabilty measures (i.e. distributions) on , and is the parameter set.
4.3.1 SpecificationPermalink
A statistical model is said to be well-specified if
A statistical model is said to be misspecified if
What does it mean for a probability model to be “well-specified” or “misspecified”?:
Well-specified means that the class of distribution
you are assuming for your modeling actually contains the unknown probability distribution from where the sample is drawn.
Misspecified means, on the other hand, that
does not contain . You made a modeling assumption, and it is not perfect: for instance, you assume your sample is Gaussian, but (maybe due to noise, or just inherently) it is not actually originating from any Gaussian distribution.
一般情况下,我们都是 assume that our statistical model is well-specified.
This particular
4.3.2 IdentifiabilityPermalink
We say that statistical model
此时 any parameter
4.3.3 DimensionPermalink
如果有
- The model is said to be parametric if it has a finite dimension
- The model is said to be nonparametric if it has a infinite dimension
注:dimension 不是 statistical model 专有的,其他涉及 parameter 的 mathematical model 应该都有这个概念。参考 Parametric vs. non-parametric models
5. Statistic / EstimatorPermalink
Given an observed sample
- A statistic is any measurable function of the sample
- An estimator of
, denoted by , is any statistic whose expression does not depend on- 比如
- 比如
- An estimate of
is denoted by- 比如
- 这里又涉及到类似 “一个 random variable
是什么意思?” 的问题。我们在根据 计算 的时候的确就是把 替换成 而已,但这并不是 的意思,也不是说要用 这个值去算,而是模拟了一个 的过程,至于这个 是多少这里我们并不关心
- 比如
注意下逻辑关系:
- estimand
= the parameter of interest (to be estimated) - estimator
= a function (of estimating) - estimate
= a value (of estimating)
从本质上来说,statistic/estimator 是一个 function,也是个 random variable,estimate 是一个值,但遗憾的是:
- 因为 statistic 没有一个名词去表示它的具体值,所以 statistic 有时表示函数 (or random variable),有时表示一个具体值
- estimator 一定是函数 (or random variable),但有时有的人就是要叫它 estimate,而且就是不愿意用大写字母表示 (
),所以你看到 “an estimate
” 这样的描述,你需要根据上下文来判断它到底说的是 “an estimator ” 还是 “an estimate ” ()
对 parameter 的 estimation 我们称为 point estimation (视
5.1 BiasPermalink
The bias of an estimator
An estimator
An estimator
5.2 Variance / Standard Deviation / Standard Error (to its own Mean)Permalink
对任意一个 random variable
Standard deviation 为:
Standard error (to its own mean) 为:
is the size of the sample (number of observations).
你把
需要注意的一点是:
有涉及 ,所以它研究的是 “ 与 之间的关系”- 而
, , 和 无关,所以它们表示的是 “ 本身的性质”
5.3 Error (to its Estimand) / Mean Square Error (to its Estimand) / (Quadratic) RiskPermalink
The error of an estimator
The mean square error of an estimator
The (quadratic) risk of an estimator
- 明显,
If
5.4 ConsistencyPermalink
Let
Estimator
which means
- 我们也可以把 convergence in probability 写作
Estimator
Estimator
- 这么一来,weak consistency 也可以看做是
consistency
Estimator
- 我们也可以把 almost sure convergence 写作
注意:
consistency weak consistency- mean square consistency
weak consistency consistency weak consistency- strong consistency
weak consistency - 但目前 mean square consistency、
consistency 和 strong consistency 这三者的强弱关系我还不清楚
5.5 Efficiency (of Unbiased Estimators)Permalink
假定
where the Fisher Information
- 严格来说,Fisher information 是衡量 model 的一个指标,所以它应该把
当做一个变量 对待,所以 应该是一个函数,当 时, 取一个具体值 - 更多内容参考 Chapter 8 - Fisher Information
The efficiency of
假设有两个 unbiased estimators
因为它们都是 unbiased,所以
6. Empirical Distribution / ResamplingPermalink
6.1 EDF Permalink
Let
where
- 注意
For a fixed
另外
6.2 What does an Empirical Distribution represent?Permalink
我们上面讨论了半天 EDF,但到底什么是 empirical distribution?我觉得 joriki’s answer, StackExchange: What does an empirical distribution represent? 说得最精炼:
The empirical distribution is the distribution you’d get if you sampled from your sample instead of the whole population.
whuber’s answer, StackExchange: What does an empirical distribution represent? 提到了我们为什么要用 empirical distribution:
The intuition is that if your observations are representative of the original population (that is, of the set of tickets in the original box), then you can study the EDF to learn how to make inferences about the contents of a box based on a sample of it.
简单来说就是:用 sample 的 empirical distribution 去模拟、去 approximate、去代替 population 的 true (but unknown) distribution.
6.3 Resampling / CVPermalink
在 machine learning 中,我们是用 training dataset 的 empirical distribution 去模拟、去 approximate、去代替 population 的 true (but unknown) distribution,那自然就有一个问题:
- 从头到尾只用一个 training dataset 会不会有失偏颇?
- 万一这一个 training dataset 的 empirical distribution 和 true distribution 差得很远怎么办?
于是我们决定给 training dataset 添加一点随机性,于是就有了 CV。然后 CV 本质上就是 resampling:
- resampling:获取不同的 samples
- CV:获取不同的 training datasets
6.3.1 Jackknife Resampling / Leave-one-out CVPermalink
参考 Introduction to resampling methods, 453 Bootstrap by Rozenn Dahyot:
The Jackknife samples are computed by leaving out one observation
可见 Jackknife resampling 和 LOOCV 其实是一码事。
6.3.2 -fold CVPermalink
6.3.3 Bootstrap Resampling (Bootstrapping)Permalink
参考 Bootstrap, Jackknife and other resampling methods, 453 Bootstrap by Rozenn Dahyot:
A bootstrap sample
Suppose all
since
When
换言之,当
6.3.4 Bootstrap Aggregating (Bagging)Permalink
Bootstrap resampling 没有直接和某种 CV 对应,但是作为一种 resampling,它可以直接出现在 ensemble framework 中,比如 bagging。
bagging 其实很简单:
- Bootstrap resample
time classifier - Bootstrap resample
time classifier - Bootstrap resample
time classifier - Aggregate
final classifier
bagging 有助于减小 variance。
7. The Likelihood Function / Sufficient StatisticsPermalink
7.1 Sufficient StatisticsPermalink
Lecture Notes 5, Stat 705, Larry Wasserman@CMU:
Suppose that
is sufficient, and other sufficient statistic , some function such that- 亦即任意的 sufficient statistic
都能变形 (或者认为是 reduce) 成 MSS
- 亦即任意的 sufficient statistic
What does sufficiency mean?
If
is sufficient, then contains all the information you need from the data to compute the likelihood function. It does not contain all the information in the data.
注意这里这个 “information” 其实是个抽象的概念,并没有一个 formal 的 definition (我暂时也没有看到有 information theory 的概念和这里有关联)。那么这个 “all the information” 具体是指什么?
- 它其实指的是 “all the information needed to compute any estimate of the parameter”
- 换言之,如果有一个 statistic
containing all the information,那么我们就可以不用 的形式、而是用 的形式去 estimate - 那么这里说 sufficient statistic
并没有 contains all the information,也就是说 无法取代 when computing estimates
Lecture Notes 6, Stat 705, Larry Wasserman@CMU 举了个例子:
Let
The likelihood function contains no information at all. But we can still estimate
and
7.2 The Likelihood FunctionPermalink
Let
where
Let
where
Wikipedia: Likelihood function:
- In informal contexts, likelihood is often used as a synonym for probability.
- In statistics, the two terms have different meanings.
- Probability is used to describe the plausibility of some data, given a value for the parameter.
- Likelihood is used to describe the plausibility of a value for the parameter, given some data.
- 非常哲学的 “一体两面” 的思想
7.3 The likelihood Function is a Minimal Sufficient StatisticPermalink
首先一个问题是:为什么 likelihood function 是一个 statistic?这个 confusion 主要来自:
- statistic 定义本身,因为 statistic 既可以指 function 又可以指 value
- likelihood function 是一个 parametric statistic (varied by
)
- 如果我们定义 likelihood
,那么它是一个 parametric statistic value - 如果我们把 likelihood 看做
,那么它是一个 parametric statistic function
likelihood function 是 sufficient 这一点比较直观,那么下一步我们需要证明它是一个 MSS。这需要 Factorization Theorem 的辅助:
假设 sample
where:
- function
depends on the data ONLY through statistic , and - function
does not depend on
再回头看 sufficiency 的意义:
- If
is sufficient, then contains all the information you need from the data to compute the likelihood function.- 这就很好理解了,因为 sufficient statistic
一定能 reduce 成 MSS
- 这就很好理解了,因为 sufficient statistic
- … It does not contain all the information in the data.
- 换言之,
也没有 contains all the information - 但有的教材认为
(roughly) contains all the information,这是个不严谨的 (我觉得可以直接认为是不正确的) 说法- 即便如此,还是存在很多的场合 where parameter 是可以用
去 estimate 的
- 即便如此,还是存在很多的场合 where parameter 是可以用
- 换言之,
8. Connections between Estimation and Machine LearningPermalink
8.1 Unsupervised / Supervised LearningPermalink
假设 population
- unsupervised learning 也有可能只 estimate
的部分性质而不是一定要 estimate . - 注:这一部分在 Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning 有更新
假设 population
- 一旦我们 assume 了 statistical model,那么 “estimating distribuition
的过程” 就可以转化成 “estimating parameter 的过程” - 注:上面描述的其实是 frequentist discriminative 的 supervised learning,不包括所有 supervised learning 的内容。这一部分在 Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning 有更新
8.2 Training Error / Test Error / Underfitting / OverfittingPermalink
我们用 design matrix 来描述 dataset,i.e.
are i.i.d (by distribution ) are i.i.d (by distribution )
我们在 evaluating machine learning model 的时候一般会选取一个 “error measure”,比如说
- 注意我们这里用的是
of an estimate 而不是 of an estimator
但是 learning 的过程不是这样的,而是先 estimate
- 这个
的矛盾就是 generalization error 的根源- 你可能会注意到这里有个 dimension 的问题:若
的话,你 有 项,如何代入 计算?秘诀就是我们可以规定 是单个 random vector 的函数,而不是固定数量的 random variables 的函数
- 你可能会注意到这里有个 dimension 的问题:若
- 然后你计算的过程决定了:
只会 minimize on ,所以一般有 (proof skeleton)-
Underfitting:
high, high -
Overfitting:
low, high -
Good fit:
low, slightly higher - Unknown fit:
high, low
-
Underfitting:
- 出现 overfitting 的一个原因可能是,我们 minimize 的
并不是 和 之间的 error,因为 我们并不知道,所以我觉得在实际计算时,我们用的是 empirical distribution 的参数 ,所以最终的关系可能是:
- 所以一定程度以后,越逼近
可能离 越远
8.3 Bias-Variance TradeoffPermalink
我这里只是想说一下这个 tradeoff 和这个式子其实是没什么关系的:
因为 bias-variance tradeoff 实际是在权衡不同类型的 models 之间的利弊 (high-variance-low-bias vs low-variance-high-bias),那 model 变了,estimator 自然就变了,
那为什么是 high-variance-low-bias vs low-variance-high-bias?因为这俩是常态,high-variance-high-bias 这属于 model 本身有问题,low-variance-low-bias 这已经是最优解了,还 tradeoff 个啥。
我有写过一篇 Hypothesis Space / Underfitting / Overfitting / Bias / Variance 可以参考。
普遍的规律是:
- conservative (less complex) models 一般 low-variance-high-bias,有 underfitting 倾向
- flexible (more complex) models 一般 high-variance-low-bias,有 overfitting 倾向
8.4 Bayes ErrorPermalink
我们常见有假设 linear model
有时为了进一步研究的方便,我们假设
8.5 Estimators used in Machine LearningPermalink
上面说的都是 “如何用 general 的 estimation 概念去理解 machine learning 的一般性质”;下面说的是 “machine learning 中具体如何做 estimation”。
其实你 perform XXX estimation,相应地就会得到 XXX estimator,但是我始终觉得把 MLE 解释为 Maximum Likelihood Estimator 更好理解,whatever.
然后有很多的 estimation,你需要分清哪些是抽象的 estimator framework,哪些是具体的 estimator,就犹如有些 estimation 你要理解成 interface,有些 estimation 你要理解成 implementation。
大的 estimation framework 有:
- Frequentist Statistics
- MLE, Maximum Likelihood Estimation
- 等价于 minimizing KL divergence
- 等价于 minimizing cross-entropy
- When
is Gaussian,等价于 minimizing - 亦即
is the cross-entropy between the empirical distribution and a Gaussian model.
- When
- 等价于 minimizing KL divergence
- MLE, Maximum Likelihood Estimation
- Bayersian Statistics
- MAP, Maximum A Posteriori (Estimation)
8.5.1 MLEPermalink
这一小节我们使用如下的符号:
- True but unknown data generating distribution
(Estimand, i.e. , 对应 )- 但是我们在 estimation 的过程中根本就不会用到这个 distribution,这里列出来只是为了对应
- 但是我们在 estimation 的过程中根本就不会用到这个 distribution,这里列出来只是为了对应
- Distribution of the statistic model
(Estimator) - Training data
(assumed to be draw independently from ) - Empirical distribution of the training data
(Approximation of the Estimand)
- entropy、cross-entropy、KL divergence 等概念可以参考 Convex Functions / Jensen’s Inequality / Jensen’s Inequality on Expectations / Gibbs’ Inequality / Entropy
- Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer.
- 注意:这里的 training data
你应该理解成 ,而 又可以是 或者 的形式。参 Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning
8.5.2 MAPPermalink
假设有 training data
Bayesians represent their assumption of
- Prior
- Likelihood
- Posterior
所谓 Maximum A Posteriori (Estimation) 其实就是 Maximum Posterior Estimation,只不过我们不叫它 MPE 而是给它起了个洋气一点的名字叫 MAP:
- MAP with a Gaussian prior on the weights thus corresponds to weight decay.
- weight decay 其实就是我们在 Hypothesis Space / Underfitting / Overfitting / Bias / Variance 讲的限制
- weight decay 你可以简单理解成
regularization,但是其实有微妙的不同,参 weight decay vs L2 regularization
- weight decay 其实就是我们在 Hypothesis Space / Underfitting / Overfitting / Bias / Variance 讲的限制
- 反过来,many regularized estimation strategies, such as maximum likelihood learning regularized with weight decay, can be interpreted as making the MAP approximation to Bayesian inference.
- This view applies when the regularization consists of adding an extra term to the objective function that corresponds to
.
- This view applies when the regularization consists of adding an extra term to the objective function that corresponds to
- 相比 MLE 有 increase bias, decrease variance 的功效
- 你联系 weight decay 就好理解了
8.5.3 MLE vs MAPPermalink
从表达式的角度出发,如果有一个
- 再强调一下,这两个 distribution 在 probabilistic 上的意义明显不同,但不妨碍它们可以有同样的表达式
所以在
Comments