2 minute read

总结自 Flexible discriminative learning with structured output support vector machines,非常好的一份 tutorial。


1. IntroPermalink

Structured output SVMs Extends SVMs to handle arbitrary output spaces, particularly ones with non-trivial structure (e.g. space of poses, textual translations, sentences in a grammar, etc.).

这一篇的符号:

  • wTxscore
  • y^(x;w) 其实就是 h(x)
  • sign(x) 就是 g(x),用 sign 还标准些
  • 没有使用 b,直接 y^(x;w)=wTx
  • ϕ 称为 feature map
    • With a feature map, the nature of the input x is irrelevant (image, video, audio, …).
  • 优化问题里使用了 hinge loss L
    • 0-1 classification 的情况下,L(i)(w)=max{0,1y(i)(wTx(i))}
    • 如果是 Support Vector Regression(直接用 wTx 来预测 y 的值),则 L(i)(w)=|y(i)wTx(i)|,因为是 1 阶的,也称 L1 error
    • hinge loss is a convex function
  • 然后它的优化问题是 minimize E(w)=λ2|w|2+1ni=1nL(i)(w)

2. SSVMPermalink

之前有说过:In structured output learning, the discriminant function becomes a function f(x,y) of both inputs and labels, and can be thought of as measuring the compatibility of the input x with the output y. 这里我们定义:

(1)f(x,y)=wTΨ(x,y)

Ψ 称为 joint feature map

进而 SSVM 变成一个搜索问题:

(2)argmaxyYwTΨ(x,y)

是不是有点 maximum likelihood 的意味?不过这个式子没有概率上的意义。

等式右边求 argmax 的部分,我们称为 Inference Problem。The efficiency of using a structured SVM (after learning) depends on how quickly the inference problem can be solved.

Standard SVMs can be easily interpreted as a structured SVMs:

  • output space: yY={1,1}
  • joint feature map: Ψ(x,y)=y2x
  • inference: y^(x;w)=argmaxy{1,1}y2wTx=sign(wTx)

3. The surrogate lossPermalink

类似 hinge loss,SSVM 也有一个 loss function Δ,优化问题于是写成 minimize E(w)=λ2|w|2+1ni=1nΔ(y(i),y^(x(i);w)),但这是一个 non-convex 问题。一般的做法是找一个 convex surrogate loss(还是用 L(w) 表示)来逼近 Δ(y,y^).

The key in the success of the structured SVMs is the existence of good surrogates. The aim is to make minimising L(i)(w) have the same effect as minimising Δ(y(i),y^(x(i);w)), 具体的术语是要求 L(i)(w)Δ(y(i),y^(x(i);w)) 的 tight binding approximation。但是 tight 的要求比较严格,所以一般有个能用的 surrogate 就可以了。常用的 surrogate 有:

  • Margin rescaling surrogate: L(i)(w)=supyYΔ(y,y^)+[wTΨ(x(i),y)wTΨ(x(i),y(i))]
    • sup 表示 “最小上界”
  • Slack rescaling surrogate: L(i)(w)=supyYΔ(y,y^)[1+wTΨ(x(i),y)wTΨ(x(i),y(i))]

后面还讲了 Cutting plane algorithmBMRM: cutting planes with a regulariser 等内容以及 Matlab 的实现,这里就不展开了。

Comments