Standardization

October 8, 2018 1 分钟阅读

1. ScalingPermalink

这个简单，scaling 一般指 $x \mapsto c x$ 或者 $x \mapsto \frac{x}{c}$ 的变形。

2. StandardizationPermalink

使数据符合 standard normal distribution $X \sim N (μ = 0, σ^{2} = 1)$ 的变形，它包含两步：

centering: $x \mapsto x - mean (x)$
scaling by standard deviation: $x \mapsto \frac{x}{std (x)}$

注意:

$μ$ 是 $mean (x)$
$σ^{2}$ 是 variance $var (x)$
$σ$ 是 standard deviation (也叫 unit variance) $std (x)$

所以合起来就是： $x \mapsto \frac{x - mean (x)}{std (x)}$ 。

另外，对单个的 scalar $x_{i}$ 而言， $\frac{x_{i} - mean (x)}{std (x)}$ 称为 $x_{i}$ 的 z-score。

3. NormalizationPermalink

3.1 NormPermalink

Quote from Wikipedia: Norm (mathematics):

Given a vector space $V$ over a subfield $F$ of the complex numbers, a norm on $V$ is a nonnegative-valued scalar function $p : V \to [0, + \infty)$ with the following properties:

$\forall c \in F, u, v \in V$ ,

$p (u + v) \leq p (u) + p (v)$
$p (c v) = | c | p (v)$
If $p (v) = 0$ then $v = 0$ is the zero vector.

常见的 norm 有：

$L_{1}$ norm： $‖ x ‖_{1} = \sum_{i = 1}^{n} | x_{i} |$
$L_{2}$ norm： $‖ x ‖_{2} = \sqrt{\sum_{i = 1}^{n} x_{i}^{2}}$
max norm： $‖ x ‖_{\infty} = max (| x_{1} |, \dots, | x_{n} |)$

3.2 Normalization Scenario 1: AlgebraPermalink

我们常说 “把一个 vector normalize 成一个 unit vector”，这其实是一个 $v \mapsto \frac{v}{‖ v ‖_{?}}$ 的变形 (即把向量除以它自身的 norm)。

我们可以看到:

$‖ \frac{v}{‖ v ‖_{1}} ‖_{1} = 1$
$‖ \frac{v}{‖ v ‖_{2}} ‖_{2} = 1$
$‖ \frac{v}{‖ v ‖_{\infty}} ‖_{\infty} = 1$

我们经常可以看到 “normalization 把数据压缩到 $[0, 1]$ 区间内” 或者类似的说法，我觉得这可能来源于 “unit vector 的 norm 为 1” 这件事，但是要注意的是：

unit vector 的 norm 为 1，只能说明各项 $x_{i}$ 的绝对值是在 $[0, 1]$ 区间内

3.3 Normalization Scenario 2: StatisticsPermalink

在统计学上说 normalization，那意思就海了去了。从 Normalization (statistics) 来看：

standardization 是一种 normalization
studentization 是一种 normalization
min-max scaling 也是一种 normalization (而且它不符合我们的 scaling 定义)

3.4 The Problems with NormalizationPermalink

我觉得统计学上 normalization 定义的最大的问题是：我看不到这些变形与 norm 的关系。

然后随之而来的第二个问题：我怎么知道你说的 normalization 是 algebra 的还是 statistics 的？

比如 sklearn.preprocessing.normalize，我一直以为它做的是 standardization，但其实不是！

sklearn 的 standardization 要用 sklearn.preprocessing.StandardScaler

它做的其实是 vector normalization，但是它的 max norm 又不是标准的：

......:
    ......:
        # https://github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/preprocessing/data.py#L1564
        if norm == 'l1':
            norms = np.abs(X).sum(axis=1)
        elif norm == 'l2':
            norms = row_norms(X)
        elif norm == 'max':
            norms = np.max(X, axis=1)  # You call this max norm???
        norms = _handle_zeros_in_scale(norms, copy=False)
        X /= norms[:, np.newaxis]

对于第二个问题，一个简单点的判别方法是：你看到 $L_{1}$ 、 $L_{2}$ 这些字眼时，那必定是 algebra 的 normalization。但是事情发展成这样，才是最值得吐槽的。

X Facebook LinkedIn Bluesky

Scaling / Normalization / Standardization

1. ScalingPermalink

2. StandardizationPermalink

3. NormalizationPermalink

3.1 NormPermalink

3.2 Normalization Scenario 1: AlgebraPermalink

3.3 Normalization Scenario 2: StatisticsPermalink

3.4 The Problems with NormalizationPermalink

分享

留下评论

猜您还喜欢

LL(0) vs. LL(1) Grammars: From Single-String to Flexible Repetition

Lark’s implementation of computing FIRST and FOLLOW sets

LL(1) Parsing

Top-Down Parsers: Recursive Descent, Predictive, and More