2 minute read

首先明确两点:

  1. Jensen’s Inequality is the property of convex functions. Convex function 本身就是通过 Jensen’s Inequality 定义的,它们基本就是一回事
  2. 在不同的领域,对 Jensen’s Inequality 做不同的展开,可以得到该特定领域的新的 inequality;所以说 Jensen’s Inequality 可以看做是一个总纲
    • 基本套路是:该领域有一个 convex function,按 Jensen’s Inequality 展开,得到领域内概念 A 小于概念 B

1. Convex Function / Jensen’s InequalityPermalink

Let X be a convex set in a real vector space and let f:XR be a function.

  • Definition of convex functions
    • f is called convex if f statisfies Jensen’s Inequality.
  • Jensen’s Inequality
    • x1,x2X,t[0,1]:f(tx1+(1t)x2)tf(x1)+(1t)f(x2)
  • Definition of concave functions
    • f is said to be concave if f is convex.

2. Jensen’s Inequality on ExpectationsPermalink

If X is a random variable and f is a convex function:

f(p1x1+p2x2++pnxn)p1f(x1)+p2f(x2)++pnf(xn)

LHS is essentially f(E(X)) and RHS E(f(X)), which together give

f(E(X))E(f(X))

3. Gibbs’ InequalityPermalink

Let p={p1,p2,,pn} be the true probability distribution for X and q={q1,q2,,qn} be another probability distribution (你可以认为一个假设的 X distrbution). Construct a random variable Y who follows Y(x)=q(x)p(x). Given f(y)=log(y) is a convex function, we have:

f(E(Y))E(f(Y))

Therefore:

logi(piqipi)ipi(logqipi)log1ipilogpiqi0ipilogpiqi

如果我们用的是 log2 的话,可以称 RHS 为 Kullback–Leibler divergence or relative entropy of p with respect to q

DKL(pq)ipilog2piqi0

这个式子我们称为 Gibbs’ Inequality.

我们接着变形:

DKL(pq)ipilog2piqi=ipilog2piipilog2qi=H(p)+H(p,q)0
  • H(p) is the entropy of distribution p
  • H(p,q) is the cross entropy of distributions p and q
  • H(p)H(p,q) the information entropy of a distribution p is less than or equal to its cross entropy with any other distribution q

Interpretations of DKL(pq)Permalink

  • In the context of machine learning, DKL(pq) is often called the information gain achieved if q is used instead of p (This is why it’s also called the relative entropy of p with respect to q).
  • In the context of coding theory, DKL(pq) can be construed as measuring the expected number of extra bits required to code samples from p using a code optimized for q rather than the code optimized for p.
  • In the context of Bayesian inference, DKL(pq) is amount of information lost when q is used to approximate p.
  • 简单说,DKL(pq) 可以衡量两个 distribution pq 的 “接近程度”
    • 如果 p=q,那么 DKL(pq)=0
    • pq 差异越大,DKL(pq) 越大

Comments