Digest of Information Theory, Inference, and Learning Algorithms
1. Introduction to Information TheoryPermalink
1.1 How can we achieve perfect communication over an imperfect, noisy communication channel?Permalink
In some cases, if we transmit data, e.g., a string of bits, over the channel, there is some probability
E.g.
- modem
phone line modem - Galileo
radio waves Earth - parent cell
- RAM
Hard Drive RAM
N.B. 当然也有主动引入 noise 的情况,比如你把一个 720p 转成 480p
考虑 1-bit data 的情况,send 时的 data 我们用
- √:
- √:
- ×:
- ×:
N.B. This is a Binary Symmetric Channel, A.k.a BSC. In BSC model, a transmitter wishes to send a bit (a zero or a one), and the receiver receives a bit.
A useful disk drive would flip no bits at all in its entire lifetime. If we expect to read and write a gigabyte per day for ten years, we require a bit error probability of the order of 10 −15 , or smaller. There are two approaches to this goal.
- The physical solution: to improve the physical characteristics of the communication channel to reduce its error probability.
- 简单说就是提高 channel 的产品工艺
- 缺点是会提高 channel 的物理成本
- The ‘system’ solution: we accept the given noisy channel as it is and add communication systems to it so that we can detect and correct the errors introduced by the channel.
- We add an encoder before the channel and a decoder after it.
- The encoder encodes the source message
into a transmitted message , adding redundancy to in some way. - The channel adds noise to
, yielding a received message . - The decoder uses the known redundancy introduced by the encoder to infer both the original signal
and the added noise.
- The encoder encodes the source message
- 新添加了 system 的 computational cost
- We add an encoder before the channel and a decoder after it.
Information theory is concerned with the theoretical limitations and po- tentials of such systems. ‘What is the best error-correcting performance we could achieve?’
Coding theory is concerned with the creation of practical encoding and decoding systems.
1.2 Error-correcting codes for BSCPermalink
Repetition codesPermalink
A straightforward idea is to repeat every bit of the message a prearranged number of times – for example, three times:
0 | 000 |
1 | 111 |
We call this repetition code
问题来了:如何 decode?
我们假设
We can spell out the posterior probability of the two alternatives thus:
is the prior probability of is the likihood of- We assume that the prior probabilities are equal:
; then maximizing the posterior probability is equivalent to maximizing the likelihood . - And we assume that this BSC has noise level
,
So the likelihood is
where
Thus the likelihood ratio for the two hypotheses is
where each factor
We guess
Let’s define
所以 decode 的策略就是:数
而且这个 decode 还可以判断 noise 位置:若
使用
Exercise 1.2. Show that the error probability is reduced by the use of
Assume the probability of bit error is
Previous noise level:
证明
Exercise 1.3.
The probability of error for
(待补充)
The repetition code
Block codes – the Hamming codePermalink
We would like to communicate with tiny probability of error and at a substantial rate. Can we improve on repetition codes? What if we add redundancy to blocks of data instead of encoding one bit at a time? We now study a simple block code.
In a linear block code, the extra
- parity: (of a number) the fact of being even or odd.
Let’s define
N.B. 这个
也可以从另一个角度理解
Because the Hamming code is a linear code, it can be written compactly in terms of matrices as follows. The transmitted codeword
where
and the encoding operation uses modulo-2 arithmetic (
N.B.
我觉得可以把
注意有的教材会写成
Syndrome decoding for the Hamming codePermalink
(待补充)
General view of decoding for linear codes: syndrome decodingPermalink
(待补充)
Summary of the Hamming code’s propertiesPermalink
(待补充)
Summary of codes’ performancesPermalink
(待补充)
1.3 What performance can the best codes achieve?Permalink
There seems to be a trade-off between the decoded bit-error probability
Noisy-channel coding theorem: For any channel, there exist codes that make it possible to communicate with arbitrarily small probability of error
The maximum rate at which communication is possible with arbitrarily small
A channel with noise level
What performance are you trying to achieve?
? You don’t need sixty disk drives – you can get that performance with just two disk drives (since is less than ). And if you want or or anything, you can get there with two disk drives too!
1.4 SummaryPermalink
(待补充)
2. Probability, Entropy, and InferencePermalink
This chapter, and its sibling, Chapter 8, devote some time to notation. Just as the White Knight distinguished between
- the song,
- the name of the song, and
- what the name of the song was called (Carroll, 1998),
we will sometimes need to be careful to distinguish between
- a random variable,
- the value of the random variable, and
- the proposition (命题) that asserts that the random variable has a particular value.
2.1 Probabilities and ensemblesPermalink
An ensemble
means ‘alphabet’.
A joint ensemble
We call
N.B. In a joint ensemble
We can obtain the marginal probability
Similarly, using briefer notation, the marginal probability of
N.B. 说到 marginal 的时候一定要记得是先有的 joint 才有的 marginal
Consider the set of bigrams
- The probability
is the probability distribution of the second letter given that the first letter is a . - The probability
is the probability distribution of the first letter given that the second letter is a .
N.B. 注意这个描述方式!
Product rule or chain rule:
N.B.
2.2 The meaning of probabilityPermalink
Probabilities can be used in two ways.
- Probabilities can describe frequencies of outcomes in random experiments
- but giving noncircular definitions of the terms ‘frequency’ and ‘random’ is a challenge
- Probabilities can also be used, more generally, to describe degrees of belief in propositions that do not involve random variables, for example
- ‘the probability that Mr. S. was the murderer of Mrs. S., given the evidence’
- ‘the probability that Shakespeare’s plays were written by Francis Bacon’
Thus probabilities can be used to describe assumptions, and to describe inferences given those assumptions. This more general use of probability to quantify beliefs is known as the Bayesian viewpoint. It is also known as the subjective (主观的;客观是 objective) interpretation of probability, since the probabilities depend on assumptions. Advocates of a Bayesian approach to data modelling and pattern recognition do not view this subjectivity as a defect, since in their view,
you cannot do inference without making assumptions.
N.B. Readers should be warned that this is not yet a globally held view.
In non-Bayesian methods probabilities are allowed to describe only random variables, while Bayesians also use probabilities to describe inferences.
2.3 Forward probabilities and inverse probabilitiesPermalink
我们先研究下 Generative Model 和 Discriminative Model。
Digress: Generative ModelsPermalink
In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters.
有点难懂,啥叫 “randomly generating”?我们再看 Quora: What is a generative model? - Jack Rae:
A generative model describes how data is generated, in terms of a probabilistic model.
另外在 ML context 下:
In the scenario of supervised learning, a generative model estimates the joint probability distribution of data
between the observed data and corresponding labels .
这个描述就好懂多了。换言之,generative model 描述生成 examples 的规律,比如
也正如 Quora: What is a generative model? - Eren Golge 所说:
Generative model assumes data is created by a particular distribution which is defined by a couple of parameters (Gaussian Distribution) or non-parametric variants…
常见的 generative models 有:
- Gaussians
- Mixtures of multinomials
- Naive Bayes
- Hidden Markov Models
- Latent Dirichlet Allocation
另外 Generative vs. Discriminative Models 谈到:
- Pros:
- We have the knowledge about the data distribution.
- Cons:
- Very expensive to get (a lot of parameters);
- Need lots of data.
这里谈到 parameter 又涉及到另外一个问题,如 Cross Validated: Generative vs. discriminative - raegtin 所言:
Another way of thinking about this is that generative algorithms make some kind of structure assumptions on your model, but discriminative algorithms make fewer assumptions.
比如我们说
Digress: Discriminative ModelPermalink
Discriminative 向来比 Generative 好理解,还是用 Generative vs. Discriminative Models 的说法吧:
Algorithms aim at learning $P(y x)$ by using probabilistic approaches (e.g., logistic regression), or non-probabilistically by mapping classes from a set of points (e.g., perceptrons and SVMs).
- Pros:
- Easy to model.
- Cons:
- To classify, but not to generate the data.
回到正题。Probability calculations often fall into one of two categories: forward probability and inverse probability.
- Forward probability problems
- Involve a generative model of a process (这里这个 process 的意思就是指生成 examples 的 process)
- The task is to compute the probability distribution.
- 提出假设并求 parameter、计算 expectation、variance 等都算 forward problems
- Inverse probability problems
- Also involve a generative model of a process
- Instead, we compute the conditional probability of one or more of the unobserved variables in the process, given the observed variables.
- 比如 HMM
- This invariably requires the use of Bayes’ theorem.
Digress: Marginal ProbabilityPermalink
大家都知道
复杂一点的例子:
Terminology of inverse probabilityPermalink
If
can be written:
For a fixed
- The conditional probability
is called the posterior probability of given - The conditional probability
is called the likelihood of parameter- also probability of
given
- also probability of
- The marginal probability
is called the prior probability of parameter is known as the evidence or marginal likelihood
The likelihood principlePermalink
The likelihood principle: given a generative model for data
In spite of the simplicity of this principle, many classical statistical methods violate it.
Comments