6 minute read

之前一直没搞清楚,这里理一理思路。

总结自 Support Vector Machines and Kernels for Computational BiologyCS229 Lecture note 3 和 ESL。


1. IntroPermalink

To start with, we will be considering a linear classifier for a binary classification problem with labels y1,1 and features x. We will use parameters w, b instead of θ, and write our classifier as

(1)h(x)=g(wTx+b)

Here, g(z)=1 if z0, and g(z)=1 otherwise.

我们称

(2)f(x)=wTx+b

discriminant function

f(x)=0 就构成了我们的 hyperplane,intuition 什么的我就不说了。

考虑到 margin 时,我们要注意一个问题,那就是 margin 的度量。按 CS229 Lecture note 3 的说法,实际我们有:

(3)functional margin=geometric margin×|w|

这样其实有点不好记。另一种表达方式是:如果 x 是 support vector 的话,那么 f(x)=wTx+b=±1(这个其实是 functional margin),此时 (geometric) margin 等于 1|w|

顺便说一下记号:

wTx=w,x=|w|2
  • wTx=w,x 叫 inner product (内积) 或者 dot product
  • |w| 称为 vector 的 norm (模)
  • w|w| 称为 unit-length vector (单位向量,模为 1)

2. The Non-Linear CasePermalink

There is a straightforward way of turning a linear classifier non-linear, or making it applicable to non-vectorial data. It consists of mapping our data to some vector space, which we will refer to as the feature space, using a function ϕ. The discriminant function then is

(4)f(x)=wTϕ(x)+b

Note that f(x) is linear in the feature space defined by the mapping ϕ; but when viewed in the original input space then it is a nonlinear function of x if ϕ(x) is a nonlinear function.

这个 mapping ϕ 可能会很复杂(比如 X=[x1,x2,x3], ϕ(X)=[x12,,x32]),这样计算 f(x) 就很不方便。

而 Kernel 号称:

Kernel methods avoid this complexity by avoiding the step of explicitly mapping the data to a high dimensional feature-space.

接下来我们就来看下 Kernel 是如何做到这一点的。

3. Lagrange duality 登场Permalink

我们先不考虑 ϕ

我们要求 maximum margin,就是求 max1|w|,也就是求 min|w|。所以这个问题可以写成:

minimizew,b12|w|2(5)subject toy(i)(wTx(i)+b)1,i=1,,n

改写一下:

minimizew,b12|w|2(OPT)subject toy(i)(wTx(i)+b)+10,i=1,,n

yes! (OPT) 出来啦!gi(w)=y(i)(wTx(i)+b)+1,然后 hi(w) 不存在,所以标准的 Lagrangian L(x,α,β) 中,x 要换成 (w,b)β 不需要,于是变成了:

(6)L(w,b,α)=12|w|2i=1mαi[y(i)(wTx(i)+b)1]

Let’s find the dual form of the problem. To do so, we need to first minimize L(w,b,α) with respect to w and b (for fixed α), to get θD, which we’ll do by setting the derivatives of L with respect to w and b to zero. We have:

wL(w,b,α)=wi=1mαiy(i)x(i)=0w=i=1mαiy(i)x(i)

剩下的 dual problem,KTT 什么的我就不推了。把上式代入 discriminant function 有:

f(x)=wTx+b=(i=1mαiy(i)x(i))Tx+b(7)=i=1mαiy(i)x(i),x+b

Hence, if we’ve found the αi’s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between x and the points in the training set. Moreover, we saw earlier that the αi’s will all be 0 except for the support vectors. Thus, many of the terms in the sum above will be 0, and we really need to find only the inner products between x and the support vectors (of which there is often only a small number) in order calculate (7) and make our prediction.

4. KernelsPermalink

现在我们再来考虑 ϕ。类似地,当 f(x)=wTϕ(x)+b 时,我们按上面那一套可以得到:

f(x)=wTϕ(x)+b(8)=i=1mαiy(i)ϕ(x(i)),ϕ(x)+b

这样我们就可以定义 kernel function 为:

(9)K(x,x)=ϕ(x),ϕ(x)

除了上一节末尾说的计算方便之外,kernel 还有一个作用就是:我现在可以不用关心 ϕ 具体是个什么函数,我只要把 ϕ(x(i)),ϕ(x) 设计出来就可以了。类似于 “屏蔽底层技术细节”。

最后,我觉得 kernel 的命名应该是 “kernel of discriminant function” 的意思。

5. Kernel ExamplesPermalink

Popular choices for K in the SVM literature are:

  • linear kernel: K(x,x)=x,x
    • 相当于没有用 ϕ 或者 ϕ(x)=x
  • dth-Degree polynomial kernel:
    • homogeneous: K(x,x)=x,xd
    • inhomogeneous: K(x,x)=(1+x,x)d
  • Gaussian kernel: K(x,x)=exp(|xx|22σ2)
  • Radial basis kernel: K(x,x)=exp(γ|xx|2)
  • Neural network kernel: K(x,x)=tanh(k1x,x+k2)
    • tanh is hyperbolic tangent
    • sinh(x)=exex2
    • cosh(x)=ex+ex2
    • tanh(x)=sinh(x)cosh(x)=exexex+ex

5.2 Kernels for SequencesPermalink

Support Vector Machines and Kernels for Computational Biology P12 说到了,我就简单写一下。

5.2.1 Kernels Describing -mer ContentPermalink

我们要做的就是把一个 sequence 映射到 feature space 的一个 vector。我们可以这样设计 feature coding:

  • 考虑所有的 dimer,以 ACGT 的顺序,x1 表示 AA 的个数,x2 表示 AC 的个数,……,x16 表示 TT 的个数
  • 如果要区分 intron 和 exon 的话,那么可以设计成:x1 表示 intronic AA 的个数,……,x16 表示 intronic TT 的个数,x17 表示 exonic AA 的个数,……,x32 表示 exonic TT 的个数
  • 比如一个 sequence 是 intro ACT,那么就只有 intronic AC 和 intronic CT 上是两个 1,其余全 0。这样的一个 vector 称为 sequence 的 spectrum

我们把 sequence 映射到 -mer spectrum 的函数命名为 Φspectrum(x),于是可以得到一个 spectrum kernel:

(10)Kspectrum(x,x)=Φspectrum(x),Φspectrum(x)

Since the spectrum kernel allows no mismatches, when is sufficiently long the chance of observing common occurrences becomes small and the kernel will no longer perform well. This problem is alleviated if we use the mixed spectrum kernel:

(11)Kmixedspectrum(x,x)=d=1βdKdspectrum(x,x)

where βd is a weighting for the different substring lengths.

5.2.2 Kernels Using Positional InformationPermalink

Analogous to Position Weight Matrices (PWMs), the idea is to analyze sequences of fixed length L and consider substrings starting at each position l=1,,L separately, as implemented by the so-called weighted degree (WD) kernel:

(12)Kweighteddegree(x,x)=l=1Ld=1βdKdspectrum(x[l:l+d],x[l:l+d])

where x[l:l+d] is the substring of length d at position l. A suggested setting for βd is βd=2(d+1)2+.

Comments