4 minute read

1. MotivationPermalink

Dimensionality Reduction helps in:

  • Data Compression
  • Visualization (because we can only plot 2D or 3D)

Principal Component Analysis (主成分分析), abbreviated as PCA, is the algorithm to implement Dimensionality Reduction。

2. PCA: Problem FormulationPermalink

2.1 Reduce from 2-dimension to 1-dimensionPermalink

假设两个 features 分别是 x1x2,以它俩为 x-axis 和 y-axis,如果有一条直线,使所有 n(x1,x2) 到它的投影点能够代替 x1x2,那么我们就把 2D 降成了 1D(直接把这条直线看做 x-axis,所有的投影点都在直线上,自然就不需要 y-axis)。

那么如何判断 “投影点能够代替 x1x2” 呢?这个是通过计算替换前后的 variance 来确定的。如果替换后的 variance 没有太大损失,就可以认为是一个有效替换。那么问题就转化成 “寻找一个替换,使替换后的 variance 最大”。

经过进一步推导(这里就不深入了),问题进一步转化为:Find a direction (a vector u(1)R2) onto which to project the data so as to minimize the projection error.

2.2 Reduce from n-dimension to k-dimensionPermalink

同理,Find k vectors u(1),u(2),,u(k)Rn onto which to project the data so as to minimize the projection error.

3. PCA: AlgorithmPermalink

3.1 Data preprocessingPermalink

Given training set x(1),,x(m), the preprocessing goes like mean normalization:

  • calculate μj=1mi=1mxj(i)
  • replace each xj(i) with xj(i)μj

If there are different features on different scales (e.g. x1=size of house, x2=number of bedrooms), scale features to have comparable range of values.

3.2 PCA algorithm and implementation in OctavePermalink

Suppose we are reducing data from n-dimensions to k-dimensions

Step 1: Compute covariance matrix (协方差矩阵):Permalink

Non-vectorized formula is Σ=1mi=1nx(i)(x(i))T

x(i)=|x1(i)x2(i)xn(i)| (size=n×1)

size(Σ)=(n×1)(1×n)=n×n

X=|(x(1))T(x(2))T(x(m))T| (size=m×n)

Vectorized formula is Σ=1mXTX

Step 2: Compute eigenvectors of covariance matrixPermalink

[U, S, V] = svd(Σ), svd for Singular Value Decomposition (奇异值分解). eig(Σ) also works but less stable.

Covariance matrix always satisfies a property called “symmetric positive semidefinite” (对称半正定矩阵), so svd == eig.

The structure of U in [U, S, V] is:

U=||||u(1)u(2)u(n)|||| (size=n×n)

u(i)=|u1(i)u2(i)un(i)| (size=n×1)

Step 3: Generate the k dimensionsPermalink

We want to reduce to k-dimensions, so pick up the first k columns of U, i.e. u(1),u(2),,u(k) into Ureduce

Ureduce=||||u(1)u(2)u(k)|||| (size=n×k)

In Octave, use U_reduce = U(:, 1:k).

The new dimension z(i)=(Ureduce)Tx(i) (size=(k×n)(n×1)=k×1)

Vectorized formula is Z=XUreduce (size=(m×n)(n×k)=m×k)

The structure of Z is:

Z=|(z(1))T(z(2))T(z(m))T| (size=m×k)

z(i)=|z1(i)z2(i)zk(i)| (size=k×1)

4. Reconstruction from Compressed RepresentationPermalink

这里说的 Reconstruction 是指 Reconstruct X from Z,更具体说来就是通过 Z 来算 X 的近似值。

算法是:xapprox(i)=Ureducez(i) (size=(n×k)(k×1)=n×1)

Vectorized formula is: Xapprox=ZUreduceT

5. Choosing the Number of Principal ComponentsPermalink

I.e. how to choose k.

5.1 AlgorithmPermalink

Average squared projection error: ASPE=1mi=1mx(i)xapprox(i)2

Total variation in the data: TV=1mi=1mx(i)2

Typically, choose k to be smallest value that satisfy ASPETV<=0.01, which means “99% of variance is retained”

在实现的时候还是只有是 k=1,2, 一个个的试

5.2 Convenient calculation with SVD resultsPermalink

我们利用 [U, S, V] = svd(Σ)S 来方便我们的计算,S 是一个 n×n 的 diagonal:

S=|s11s22...snn|

For a given k, ASPETV=1i=1ksiii=1nsii.(注意这里 sii 是递减的,i.e. s11 占 variance 的比重最大,s22 次之,依次类推)

这样我们只用计算一次 [U, S, V] = svd(Σ),然后尝试 k=1,2, 使 i=1ksiii=1nsii>=0.99 就可以了,而不是每次都用 ASPETV 的公式来算。

6. Advice for Applying PCAPermalink

6.1 Good use of PCAPermalink

Application of PCA:

  • Compression
    • Reduce memory/disk needed to store data
    • Speed up learning algorithm
      • choose k by xx% of variance retaining
  • Visualization
    • choose k=2 or k=3

PCA can be used to speedup learning algorithm, most commonly the supervised learning.

Suppose we have (x(1),y(1)),,(x(m),y(m)) and n=10000 (feature#). Extract inputs to make a unlabeled dataset x(1),,x(m)R10000. If PCA applied, say we reduce to 1000 features, we would have z(1),,z(m)R1000. Then we have a new training set (z(1),y(1)),,(z(m),y(m)), which is much cheaper computationally.

6.2 Bad use of PCAPermalink

  • To prevent overfitting
    • Use PCA to reduce the number of features, thus, fewer features, less likely to overfit.

This is bad use because PCA is not a good way to address overfitting. Use regularization instead.

6.3 Implementation tipsPermalink

  • Note 1: The mapping x(i)z(i) should be defined by running PCA only on the training set. But this mapping can be applied as well to the examples xcv(i) and xtest(i) in the cross validation and test sets.
  • Note 2: Before implemen1ng PCA, first try running whatever you want to do with the original/raw data x(i). Only if that does not do what you want, then implement PCA and consider using z(i).

Comments