2 minute read

参考自 Multi-label Linear Discriminant Analysis


Linear discriminant analysis (LDA) is a well-known method for dimensionality reduction.

Given a data set with $ n $ samples $ \lbrace x^{(i)}, y^{(i)} \rbrace^n_{i=1} $ and $ K $ classes, where $ x^{(i)} \in \mathbb{R}^p $ and $ y^{(i)} \in \lbrace0, 1\rbrace^K $ ($ K $ 维的 0-1 vector). $ y^{(i)}_k = 1 $ if $ x^{(i)} $ belongs to the $k$^th class, and 0 otherwise.

Let input data be partitioned into $ K $ groups as $ \lbrace \pi_k \rbrace^K_{k=1} $, where $ \pi_k $ denotes the group of the $k$^th class with $ n_k $ data points. Classical LDA deals with single-label problems, where data partitions are mutually exclusive, i.e., $ \pi_i \cap \pi_j = \varnothing $ if $ i \neq j $, and $ \sum^{K}_{k=1} n_k = n $.

We write $ X = [ x^{(1)},\cdots,x^{(n)} ]^T $ and

\[Y = [y^{(1)},\cdots,y^{(n)}]^T = [y_{(1)},\cdots,y_{(K)}]\]

where $ y_{(k)} \in {0, 1}^n $ is the class-wise label indication vector for the $k^{th}$ class.

简单理一下:

  • # of features = $ p $
  • # of samples = $ n $
  • $ x^{(i)} $ is a $ p \times 1 $ vector
  • $ X $ is a $ n \times p $ matrix
  • $ y^{(i)} $ is a $ K \times 1 $ vector
  • $ y_{(i)} $ is a $ n \times 1 $ vector
  • $ Y $ is a $ n \times K $ matrix

Classical LDA seeks a linear transformation $ G \in \mathbb{R}^{p \times r} $ that maps $ x^{(i)} $ in the high $p$-dimensional space to $ q^{(i)} \in \mathbb{R}^{r} $ in a lower $r$-dimensional ($r < p $) space by $ q^{(i)} = G^T x^{(i)} $. In classical LDA, the between-class, within-class, and total-class scatter matrices are defined as follows:

\[\begin{align} S_b &= \sum_{k=1}^{K}{n_k(m_k - m)(m_k - m)^T} \newline S_w &= \sum_{k=1}^{K}{\sum_{x^{(i)} \in \pi_k}{(x^{(i)} - m_k)(x^{(i)} - m_k)^T}} \newline S_t &= \sum_{i=1}^{n}{(x^{(i)} - m)(x^{(i)} - m)^T} \end{align}\]

where $ m_k = \frac{1}{n_k} \sum_{x^{(i)} \in \pi_k}{x^{(i)}} $ is the class mean (class centroid) of the $k$^th class, $ m = \frac{1}{n} \sum_{i=1}^{n}{x^{(i)}} $ is the global mean (global centroid), and $ S_t = S_b + S_w $.

The optimal $ G $ is chosen such that the between-class distance is maximize whilst the within-class distance is minimized in the low-dimensional projected space, which leads to the standard LDA optimization objective as follows:

\[\begin{align} J &= tr \left ( \frac{G^T S_b G}{G^T S_w G} \right ) \newline G^* &= \underset{G}{\operatorname{argmax}} J \end{align}\]

In linear algebra, the trace (迹) of an $ n \times n $ square matrix $ A $ is defined to be the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right) of $ A $, i.e.,

\[\operatorname{tr}(A) = a_{11} + a_{22} + \dots + a_{nn} = \sum_{i=1}^{n} a_{ii}\]

Comments