Machine Learning: Dimensionality Reduction

September 6, 2014 4 分钟阅读

1. MotivationPermalink

Dimensionality Reduction helps in:

Data Compression
Visualization (because we can only plot 2D or 3D)

Principal Component Analysis (主成分分析), abbreviated as PCA, is the algorithm to implement Dimensionality Reduction。

2. PCA: Problem FormulationPermalink

2.1 Reduce from 2-dimension to 1-dimensionPermalink

假设两个 features 分别是 $x_{1}$ 和 $x_{2}$ ，以它俩为 x-axis 和 y-axis，如果有一条直线，使所有 $n$ 个 $(x_{1}, x_{2})$ 到它的投影点能够代替 $x_{1}$ 和 $x_{2}$ ，那么我们就把 2D 降成了 1D（直接把这条直线看做 x-axis，所有的投影点都在直线上，自然就不需要 y-axis）。

那么如何判断 “投影点能够代替 $x_{1}$ 和 $x_{2}$ ” 呢？这个是通过计算替换前后的 variance 来确定的。如果替换后的 variance 没有太大损失，就可以认为是一个有效替换。那么问题就转化成 “寻找一个替换，使替换后的 variance 最大”。

经过进一步推导（这里就不深入了），问题进一步转化为：Find a direction (a vector $u^{(1)} \in R^{2}$ ) onto which to project the data so as to minimize the projection error.

2.2 Reduce from n-dimension to $k$ -dimensionPermalink

同理，Find $k$ vectors $u^{(1)}, u^{(2)}, \dots, u^{(k)} \in R^{n}$ onto which to project the data so as to minimize the projection error.

3. PCA: AlgorithmPermalink

3.1 Data preprocessingPermalink

Given training set $x^{(1)}, \dots, x^{(m)}$ , the preprocessing goes like mean normalization:

calculate $μ_{j} = \frac{1}{m} \sum_{i = 1}^{m} x_{j}^{(i)}$
replace each $x_{j}^{(i)}$ with $x_{j}^{(i)} - μ_{j}$

If there are different features on different scales (e.g. $x_{1} = size of house$ , $x_{2} = number of bedrooms$ ), scale features to have comparable range of values.

3.2 PCA algorithm and implementation in OctavePermalink

Suppose we are reducing data from $n$ -dimensions to $k$ -dimensions

Step 1: Compute covariance matrix (协方差矩阵):Permalink

Non-vectorized formula is $Σ = \frac{1}{m} \sum_{i = 1}^{n} x^{(i)} * (x^{(i)})^{T}$

$∵ x^{(i)} = | \begin{matrix} x_{1}^{(i)} \\ x_{2}^{(i)} \\ \dots \\ x_{n}^{(i)} \end{matrix} |$ ( $size = n \times 1$ )

$∴ size (Σ) = (n \times 1) * (1 \times n) = n \times n$

又 $∵ X = | \begin{matrix} - (x^{(1)})^{T} - \\ - (x^{(2)})^{T} - \\ \dots \dots \\ - (x^{(m)})^{T} - \end{matrix} |$ ( $size = m \times n$ )

$∴$ Vectorized formula is $Σ = \frac{1}{m} X^{T} X$

Step 2: Compute eigenvectors of covariance matrixPermalink

[U, S, V] = svd(Σ), svd for Singular Value Decomposition (奇异值分解). eig(Σ) also works but less stable.

Covariance matrix always satisfies a property called “symmetric positive semidefinite” (对称半正定矩阵), so svd == eig.

The structure of U in [U, S, V] is:

$U = | \begin{matrix} | & | & | \\ u^{(1)} & u^{(2)} & \dots & u^{(n)} \\ | & | & | \end{matrix} |$ ( $size = n \times n$ )

$u^{(i)} = | \begin{matrix} u_{1}^{(i)} \\ u_{2}^{(i)} \\ \dots \\ u_{n}^{(i)} \end{matrix} |$ ( $size = n \times 1$ )

Step 3: Generate the $k$ dimensionsPermalink

We want to reduce to $k$ -dimensions, so pick up the first $k$ columns of U, i.e. $u^{(1)}, u^{(2)}, \dots, u^{(k)}$ into $U_{r e d u c e}$

$U_{r e d u c e} = | \begin{matrix} | & | & | \\ u^{(1)} & u^{(2)} & \dots & u^{(k)} \\ | & | & | \end{matrix} |$ ( $size = n \times k$ )

In Octave, use U_reduce = U(:, 1:k).

The new dimension $z^{(i)} = (U_{r e d u c e})^{T} * x^{(i)}$ ( $size = (k \times n) * (n \times 1) = k \times 1$ )

Vectorized formula is $Z = X * U_{r e d u c e}$ ( $size = (m \times n) * (n \times k) = m \times k$ )

The structure of $Z$ is:

$Z = | \begin{matrix} - (z^{(1)})^{T} - \\ - (z^{(2)})^{T} - \\ \dots \dots \\ - (z^{(m)})^{T} - \end{matrix} |$ ( $size = m \times k$ )

$z^{(i)} = | \begin{matrix} z_{1}^{(i)} \\ z_{2}^{(i)} \\ \dots \\ z_{k}^{(i)} \end{matrix} |$ ( $size = k \times 1$ )

4. Reconstruction from Compressed RepresentationPermalink

这里说的 Reconstruction 是指 Reconstruct $X$ from $Z$ ，更具体说来就是通过 $Z$ 来算 $X$ 的近似值。

算法是： $x_{a p p r o x}^{(i)} = U_{r e d u c e} * z^{(i)}$ ( $size = (n \times k) * (k \times 1) = n \times 1$ )

Vectorized formula is: $X_{a p p r o x} = Z * U_{r e d u c e}^{T}$

5. Choosing the Number of Principal ComponentsPermalink

I.e. how to choose $k$ .

5.1 AlgorithmPermalink

Average squared projection error: $A S P E = \frac{1}{m} \sum_{i = 1}^{m} ‖ x^{(i)} - x_{a p p r o x}^{(i)} ‖^{2}$

Total variation in the data: $T V = \frac{1}{m} \sum_{i = 1}^{m} ‖ x^{(i)} ‖^{2}$

Typically, choose $k$ to be smallest value that satisfy $\frac{A S P E}{T V} <= 0.01$ , which means “99% of variance is retained”

在实现的时候还是只有是 $k = 1, 2, \dots$ 一个个的试

5.2 Convenient calculation with SVD resultsPermalink

我们利用 [U, S, V] = svd(Σ) 的 $S$ 来方便我们的计算， $S$ 是一个 $n \times n$ 的 diagonal:

S = | \begin{matrix} s_{11} \\ s_{22} \\ . . . \\ s_{n n} \end{matrix} |

For a given $k$ , $\frac{A S P E}{T V} = 1 - \frac{\sum_{i = 1}^{k} s_{i i}}{\sum_{i = 1}^{n} s_{i i}}$ .（注意这里 $s_{i i}$ 是递减的，i.e. $s_{11}$ 占 variance 的比重最大， $s_{22}$ 次之，依次类推）

这样我们只用计算一次 [U, S, V] = svd(Σ)，然后尝试 $k = 1, 2, \dots$ 使 $\frac{\sum_{i = 1}^{k} s_{i i}}{\sum_{i = 1}^{n} s_{i i}} >= 0.99$ 就可以了，而不是每次都用 $A S P E$ 和 $T V$ 的公式来算。

6. Advice for Applying PCAPermalink

6.1 Good use of PCAPermalink

Application of PCA:

Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm
  - choose $k$ by xx% of variance retaining
Visualization
- choose $k = 2$ or $k = 3$

PCA can be used to speedup learning algorithm, most commonly the supervised learning.

Suppose we have $(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})$ and $n = 10000$ (feature#). Extract inputs to make a unlabeled dataset $x^{(1)}, \dots, x^{(m)} \in R^{10000}$ . If PCA applied, say we reduce to 1000 features, we would have $z^{(1)}, \dots, z^{(m)} \in R^{1000}$ . Then we have a new training set $(z^{(1)}, y^{(1)}), \dots, (z^{(m)}, y^{(m)})$ , which is much cheaper computationally.

6.2 Bad use of PCAPermalink

To prevent overfitting
- Use PCA to reduce the number of features, thus, fewer features, less likely to overfit.

This is bad use because PCA is not a good way to address overfitting. Use regularization instead.

6.3 Implementation tipsPermalink

Note 1: The mapping $x^{(i)} \to z^{(i)}$ should be defined by running PCA only on the training set. But this mapping can be applied as well to the examples $x_{c v}^{(i)}$ and $x_{t e s t}^{(i)}$ in the cross validation and test sets.
Note 2: Before implemen1ng PCA, first try running whatever you want to do with the original/raw data $x^{(i)}$ . Only if that does not do what you want, then implement PCA and consider using $z^{(i)}$ .

X Facebook LinkedIn Bluesky

Machine Learning: Dimensionality Reduction

1. MotivationPermalink

2. PCA: Problem FormulationPermalink

2.1 Reduce from 2-dimension to 1-dimensionPermalink

2.2 Reduce from n-dimension to $k$ -dimensionPermalink

3. PCA: AlgorithmPermalink

3.1 Data preprocessingPermalink

3.2 PCA algorithm and implementation in OctavePermalink

Step 1: Compute covariance matrix (协方差矩阵):Permalink

Step 2: Compute eigenvectors of covariance matrixPermalink

Step 3: Generate the $k$ dimensionsPermalink

4. Reconstruction from Compressed RepresentationPermalink

5. Choosing the Number of Principal ComponentsPermalink

5.1 AlgorithmPermalink

5.2 Convenient calculation with SVD resultsPermalink

6. Advice for Applying PCAPermalink

6.1 Good use of PCAPermalink

6.2 Bad use of PCAPermalink

6.3 Implementation tipsPermalink

分享

留下评论

猜您还喜欢

LR Parsing #5: Intuition Revisited

LR Parsing #4: Runtime Encoding of LR(0)/SLR(1) Parsing DFA (How to Construct the Parsing Tables)

LR Parsing #3: Simulation of the Parsing DFA (Configuration / Shift-Reduce / Structure of Parsing Table)

LR Parsing #2: Structural Encoding of LR(0) Parsing DFA

1. MotivationPermalink

2. PCA: Problem FormulationPermalink

2.1 Reduce from 2-dimension to 1-dimensionPermalink

2.2 Reduce from n-dimension to k-dimensionPermalink

3. PCA: AlgorithmPermalink

3.1 Data preprocessingPermalink

3.2 PCA algorithm and implementation in OctavePermalink

Step 1: Compute covariance matrix (协方差矩阵):Permalink

Step 2: Compute eigenvectors of covariance matrixPermalink

Step 3: Generate the k dimensionsPermalink

4. Reconstruction from Compressed RepresentationPermalink

5. Choosing the Number of Principal ComponentsPermalink

5.1 AlgorithmPermalink

5.2 Convenient calculation with SVD resultsPermalink

6. Advice for Applying PCAPermalink

6.1 Good use of PCAPermalink

6.2 Bad use of PCAPermalink

6.3 Implementation tipsPermalink

分享

留下评论

猜您还喜欢

LR Parsing #5: Intuition Revisited

LR Parsing #4: Runtime Encoding of LR(0)/SLR(1) Parsing DFA (How to Construct the Parsing Tables)

LR Parsing #3: Simulation of the Parsing DFA (Configuration / Shift-Reduce / Structure of Parsing Table)

LR Parsing #2: Structural Encoding of LR(0) Parsing DFA

2.2 Reduce from n-dimension to $k$ -dimensionPermalink

Step 3: Generate the $k$ dimensionsPermalink