Structured Output Support Vector Machines

December 11, 2014 2 分钟阅读

总结自 Flexible discriminative learning with structured output support vector machines，非常好的一份 tutorial。

1. IntroPermalink

Structured output SVMs Extends SVMs to handle arbitrary output spaces, particularly ones with non-trivial structure (e.g. space of poses, textual translations, sentences in a grammar, etc.).

这一篇的符号：

$w^{T} x$ 叫 score
$\hat{y} (x; w)$ 其实就是 $h (x)$
$s i g n (x)$ 就是 $g (x)$ ，用 $s i g n$ 还标准些
没有使用 $b$ ，直接 $\hat{y} (x; w) = w^{T} x$
$ϕ$ 称为 feature map
- With a feature map, the nature of the input $x$ is irrelevant (image, video, audio, …).
优化问题里使用了 hinge loss $L$
- 0-1 classification 的情况下， $L^{(i)} (w) = max {0, 1 - y^{(i)} (w^{T} x^{(i)})}$
- 如果是 Support Vector Regression（直接用 $w^{T} x$ 来预测 $y$ 的值），则 $L^{(i)} (w) = | y^{(i)} - w^{T} x^{(i)} |$ ，因为是 1 阶的，也称 $L^{1}$ error
- hinge loss is a convex function
然后它的优化问题是 minimize $E (w) = \frac{λ}{2} {| w |}^{2} + \frac{1}{n} \sum_{i = 1} n L^{(i)} (w)$

之前有说过：In structured output learning, the discriminant function becomes a function $f (x, y)$ of both inputs and labels, and can be thought of as measuring the compatibility of the input $x$ with the output $y$ . 这里我们定义：

\begin{array}{r} (1) & f (x, y) = w^{T} Ψ (x, y) \end{array}

$Ψ$ 称为 joint feature map。

进而 SSVM 变成一个搜索问题：

\begin{aligned} (2) & \underset{y \in Y}{\arg max} & w^{T} Ψ (x, y) \end{aligned}

是不是有点 maximum likelihood 的意味？不过这个式子没有概率上的意义。

等式右边求 $\arg max$ 的部分，我们称为 Inference Problem。The efficiency of using a structured SVM (after learning) depends on how quickly the inference problem can be solved.

Standard SVMs can be easily interpreted as a structured SVMs:

output space: $y \in Y = {- 1, 1}$
joint feature map: $Ψ (x, y) = \frac{y}{2} x$
inference: $\hat{y} (x; w) = \underset{y \in {- 1, 1}}{\arg max} \frac{y}{2} w^{T} x = s i g n (w^{T} x)$

3. The surrogate lossPermalink

类似 hinge loss，SSVM 也有一个 loss function $Δ$ ，优化问题于是写成 minimize $E (w) = \frac{λ}{2} {| w |}^{2} + \frac{1}{n} \sum_{i = 1} n Δ (y^{(i)}, \hat{y} (x^{(i)}; w))$ ，但这是一个 non-convex 问题。一般的做法是找一个 convex surrogate loss（还是用 $L (w)$ 表示）来逼近 $Δ (y, \hat{y})$ .

The key in the success of the structured SVMs is the existence of good surrogates. The aim is to make minimising $L^{(i)} (w)$ have the same effect as minimising $Δ (y^{(i)}, \hat{y} (x^{(i)}; w))$ , 具体的术语是要求 $L^{(i)} (w)$ 是 $Δ (y^{(i)}, \hat{y} (x^{(i)}; w))$ 的 tight binding approximation。但是 tight 的要求比较严格，所以一般有个能用的 surrogate 就可以了。常用的 surrogate 有：

Margin rescaling surrogate: $L^{(i)} (w) = sup_{y \in Y} Δ (y, \hat{y}) + [w^{T} Ψ (x^{(i)}, y) - w^{T} Ψ (x^{(i)}, y^{(i)})]$
- $sup$ 表示 “最小上界”
Slack rescaling surrogate: $L^{(i)} (w) = sup_{y \in Y} Δ (y, \hat{y}) [1 + w^{T} Ψ (x^{(i)}, y) - w^{T} Ψ (x^{(i)}, y^{(i)})]$

后面还讲了 Cutting plane algorithm、BMRM: cutting planes with a regulariser 等内容以及 Matlab 的实现，这里就不展开了。

X Facebook LinkedIn Bluesky

Structured Output Support Vector Machines

1. IntroPermalink

2. SSVMPermalink

3. The surrogate lossPermalink

分享

留下评论

猜您还喜欢

LR Parsing #5: Intuition Revisited

LR Parsing #4: Runtime Encoding of LR(0)/SLR(1) Parsing DFA (How to Construct the Parsing Tables)

LR Parsing #3: Simulation of the Parsing DFA (Configuration / Shift-Reduce / Structure of Parsing Table)

LR Parsing #2: Structural Encoding of LR(0) Parsing DFA