Support Vector Machines and Kernels

December 5, 2014 6 分钟阅读

之前一直没搞清楚，这里理一理思路。

总结自 Support Vector Machines and Kernels for Computational Biology、CS229 Lecture note 3 和 ESL。

1. IntroPermalink

To start with, we will be considering a linear classifier for a binary classification problem with labels $y \in - 1, 1$ and features $x$ . We will use parameters $w$ , $b$ instead of $θ$ , and write our classifier as

\begin{array}{r} (1) & h (x) = g (w^{T} x + b) \end{array}

Here, $g (z) = 1$ if $z \geq 0$ , and $g (z) = - 1$ otherwise.

我们称

\begin{array}{r} (2) & f (x) = w^{T} x + b \end{array}

为 discriminant function。

$f (x) = 0$ 就构成了我们的 hyperplane，intuition 什么的我就不说了。

考虑到 margin 时，我们要注意一个问题，那就是 margin 的度量。按 CS229 Lecture note 3 的说法，实际我们有：

\begin{array}{r} (3) & functional margin = geometric margin \times | w | \end{array}

这样其实有点不好记。另一种表达方式是：如果 $x^{*}$ 是 support vector 的话，那么 $f (x^{*}) = w^{T} x^{*} + b = \pm 1$ （这个其实是 functional margin），此时 (geometric) margin 等于 $\frac{1}{| w |}$ 。

顺便说一下记号：

\begin{array}{r} w^{T} x = ⟨ w, x ⟩ = {| w |}^{2} \end{array}

$w^{T} x = ⟨ w, x ⟩$ 叫 inner product (内积) 或者 dot product
$| w |$ 称为 vector 的 norm (模)
$\frac{w}{| w |}$ 称为 unit-length vector (单位向量，模为 1)

2. The Non-Linear CasePermalink

There is a straightforward way of turning a linear classifier non-linear, or making it applicable to non-vectorial data. It consists of mapping our data to some vector space, which we will refer to as the feature space, using a function $ϕ$ . The discriminant function then is

\begin{array}{r} (4) & f (x) = w^{T} ϕ (x) + b \end{array}

Note that $f (x)$ is linear in the feature space defined by the mapping $ϕ$ ; but when viewed in the original input space then it is a nonlinear function of $x$ if $ϕ (x)$ is a nonlinear function.

这个 mapping $ϕ$ 可能会很复杂（比如 $X = [x_{1}, x_{2}, x_{3}]$ , $ϕ (X) = [x_{1}^{2}, \dots, x_{3}^{2}]$ ），这样计算 $f (x)$ 就很不方便。

而 Kernel 号称：

Kernel methods avoid this complexity by avoiding the step of explicitly mapping the data to a high dimensional feature-space.

接下来我们就来看下 Kernel 是如何做到这一点的。

3. Lagrange duality 登场Permalink

我们先不考虑 $ϕ$ 。

我们要求 maximum margin，就是求 $max \frac{1}{| w |}$ ，也就是求 $min | w |$ 。所以这个问题可以写成：

\begin{aligned} \underset{w, b}{minimize} & \frac{1}{2} {| w |}^{2} \\ (5) & subject to & y^{(i)} (w^{T} x^{(i)} + b) \geq 1, & i = 1, \dots, n \end{aligned}

改写一下：

\begin{aligned} \underset{w, b}{minimize} & \frac{1}{2} {| w |}^{2} \\ (OPT) & subject to & - y^{(i)} (w^{T} x^{(i)} + b) + 1 \leq 0, & i = 1, \dots, n \end{aligned}

yes! $(OPT)$ 出来啦！ $g_{i} (w) = - y^{(i)} (w^{T} x^{(i)} + b) + 1$ ，然后 $h_{i} (w)$ 不存在，所以标准的 Lagrangian $L (x, α, β)$ 中， $x$ 要换成 $(w, b)$ ， $β$ 不需要，于是变成了：

\begin{array}{r} (6) & L (w, b, α) = \frac{1}{2} {| w |}^{2} - \sum_{i = 1}^{m} α_{i} [y^{(i)} (w^{T} x^{(i)} + b) - 1] \end{array}

Let’s find the dual form of the problem. To do so, we need to first minimize $L (w, b, α)$ with respect to $w$ and $b$ (for fixed $α$ ), to get $θ_{D}$ , which we’ll do by setting the derivatives of $L$ with respect to $w$ and $b$ to zero. We have:

\begin{aligned} \nabla_{w} L (w, b, α) & = w - \sum_{i = 1}^{m} α_{i} y^{(i)} x^{(i)} = 0 \\ w & = \sum_{i = 1}^{m} α_{i} y^{(i)} x^{(i)} \end{aligned}

剩下的 dual problem，KTT 什么的我就不推了。把上式代入 discriminant function 有：

\begin{aligned} f (x) & = w^{T} x + b \\ = {(\sum_{i = 1}^{m} α_{i} y^{(i)} x^{(i)})}^{T} x + b \\ (7) & = \sum_{i = 1}^{m} α_{i} y^{(i)} ⟨ x^{(i)}, x ⟩ + b \end{aligned}

Hence, if we’ve found the $α_{i}$ ’s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between $x$ and the points in the training set. Moreover, we saw earlier that the $α_{i}$ ’s will all be 0 except for the support vectors. Thus, many of the terms in the sum above will be 0, and we really need to find only the inner products between $x$ and the support vectors (of which there is often only a small number) in order calculate $(7)$ and make our prediction.

4. KernelsPermalink

现在我们再来考虑 $ϕ$ 。类似地，当 $f (x) = w^{T} ϕ (x) + b$ 时，我们按上面那一套可以得到：

\begin{aligned} f (x) & = w^{T} ϕ (x) + b \\ (8) & = \sum_{i = 1}^{m} α_{i} y^{(i)} ⟨ ϕ (x^{(i)}), ϕ (x) ⟩ + b \end{aligned}

这样我们就可以定义 kernel function 为：

\begin{array}{r} (9) & K (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'}) ⟩ \end{array}

除了上一节末尾说的计算方便之外，kernel 还有一个作用就是：我现在可以不用关心 $ϕ$ 具体是个什么函数，我只要把 $⟨ ϕ (x^{(i)}), ϕ (x) ⟩$ 设计出来就可以了。类似于 “屏蔽底层技术细节”。

最后，我觉得 kernel 的命名应该是 “kernel of discriminant function” 的意思。

5. Kernel ExamplesPermalink

5.1 Popular KernelsPermalink

Popular choices for $K$ in the SVM literature are:

linear kernel: $K (x, x^{'}) = ⟨ x, x^{'} ⟩$
- 相当于没有用 $ϕ$ 或者 $ϕ (x) = x$
dth-Degree polynomial kernel:
- homogeneous: $K (x, x^{'}) = {⟨ x, x^{'} ⟩}^{d}$
- inhomogeneous: $K (x, x^{'}) = (1 + ⟨ x, x^{'} ⟩)^{d}$
Gaussian kernel: $K (x, x^{'}) = \exp (- \frac{{| x - x^{'} |}^{2}}{2 σ^{2}})$
Radial basis kernel: $K (x, x^{'}) = \exp (- γ {| x - x^{'} |}^{2})$
Neural network kernel: $K (x, x^{'}) = t a n h (k_{1} ⟨ x, x^{'} ⟩ + k_{2})$
- tanh is hyperbolic tangent
- $s i n h (x) = \frac{e^{x} - e^{- x}}{2}$
- $c o s h (x) = \frac{e^{x} + e^{- x}}{2}$
- $t a n h (x) = \frac{s i n h (x)}{c o s h (x)} = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$

5.2 Kernels for SequencesPermalink

Support Vector Machines and Kernels for Computational Biology P12 说到了，我就简单写一下。

5.2.1 Kernels Describing $ℓ$ -mer ContentPermalink

我们要做的就是把一个 sequence 映射到 feature space 的一个 vector。我们可以这样设计 feature coding：

考虑所有的 dimer，以 ACGT 的顺序， $x_{1}$ 表示 AA 的个数， $x_{2}$ 表示 AC 的个数，……， $x_{16}$ 表示 TT 的个数
如果要区分 intron 和 exon 的话，那么可以设计成： $x_{1}$ 表示 intronic AA 的个数，……， $x_{16}$ 表示 intronic TT 的个数， $x_{17}$ 表示 exonic AA 的个数，……， $x_{32}$ 表示 exonic TT 的个数
比如一个 sequence 是 intro ACT，那么就只有 intronic AC 和 intronic CT 上是两个 1，其余全 0。这样的一个 vector 称为 sequence 的 spectrum

我们把 sequence 映射到 $ℓ$ -mer spectrum 的函数命名为 $Φ_{ℓ}^{s p e c t r u m} (x)$ ，于是可以得到一个 spectrum kernel：

\begin{array}{r} (10) & K_{ℓ}^{s p e c t r u m} (x, x^{'}) = ⟨ Φ_{ℓ}^{s p e c t r u m} (x), Φ_{ℓ}^{s p e c t r u m} (x^{'}) ⟩ \end{array}

Since the spectrum kernel allows no mismatches, when $ℓ$ is sufficiently long the chance of observing common occurrences becomes small and the kernel will no longer perform well. This problem is alleviated if we use the mixed spectrum kernel:

\begin{array}{r} (11) & K_{ℓ}^{m i x e d s p e c t r u m} (x, x^{'}) = \sum_{d = 1}^{ℓ} β_{d} K_{d}^{s p e c t r u m} (x, x^{'}) \end{array}

where $β_{d}$ is a weighting for the different substring lengths.

5.2.2 Kernels Using Positional InformationPermalink

Analogous to Position Weight Matrices (PWMs), the idea is to analyze sequences of fixed length $L$ and consider substrings starting at each position $l = 1, \dots, L$ separately, as implemented by the so-called weighted degree (WD) kernel:

\begin{array}{r} (12) & K_{ℓ}^{w e i g h t e d d e g r e e} (x, x^{'}) = \sum_{l = 1}^{L} \sum_{d = 1}^{ℓ} β_{d} K_{d}^{s p e c t r u m} (x_{[l : l + d]}, x_{[l : l + d]}^{'}) \end{array}

where $x_{[l : l + d]}$ is the substring of length $d$ at position $l$ . A suggested setting for $β_{d}$ is $β_{d} = \frac{2 (ℓ - d + 1)}{ℓ^{2} + ℓ}$ .

X Facebook LinkedIn Bluesky

Support Vector Machines and Kernels

1. IntroPermalink

2. The Non-Linear CasePermalink

3. Lagrange duality 登场Permalink

4. KernelsPermalink

5. Kernel ExamplesPermalink

5.1 Popular KernelsPermalink

5.2 Kernels for SequencesPermalink

5.2.1 Kernels Describing $ℓ$ -mer ContentPermalink

5.2.2 Kernels Using Positional InformationPermalink

分享

留下评论

猜您还喜欢

LL(0) vs. LL(1) Grammars: From Single-String to Flexible Repetition

Lark’s implementation of computing FIRST and FOLLOW sets

LL(1) Parsing

Top-Down Parsers: Recursive Descent, Predictive, and More

1. IntroPermalink

2. The Non-Linear CasePermalink

3. Lagrange duality 登场Permalink

4. KernelsPermalink

5. Kernel ExamplesPermalink

5.1 Popular KernelsPermalink

5.2 Kernels for SequencesPermalink

5.2.1 Kernels Describing ℓ-mer ContentPermalink

5.2.2 Kernels Using Positional InformationPermalink

分享

留下评论

猜您还喜欢

LL(0) vs. LL(1) Grammars: From Single-String to Flexible Repetition

Lark’s implementation of computing FIRST and FOLLOW sets

LL(1) Parsing

Top-Down Parsers: Recursive Descent, Predictive, and More

5.2.1 Kernels Describing $ℓ$ -mer ContentPermalink