PWM (PSSM) / Sequence Logo
- PWM: Position Weight Matrix
- PSSM: Position-Specific Scoring Matrix
这俩其实是同一个概念,就是 motif 的 probabilistic representation。下面我们先从简单的 PFM 和 PPM 入手。
1. Motif Model / PFM: Position Frequency Matrix / PPM: Position Probability MatrixPermalink
首先看 RSA-tools: Introduction to cis-regulation 上的这个例子:
The TRANSFAC (TRANScription FACtor) database contains 8 binding sites for the yeast transcription factor Pho4p
- 5/8 contain the core of high-affinity binding sites (
CACGTG
) - 3/8 contain the core of medium-affinity binding sites (
CACGTT
)
我们想根据这 8 个 TFBS 来制定它对应的 TFBM (Transcription Factor Binding Motif)。我们首先得到的这个 Count Matrix,其实就是 PFM (Position Frequency Matrix)。我们把每一个 cell 都除以 8,得到的就是 PPM (Position Probability Matrix):
Both PPMs (and PWMs as well) assume statistical independence between positions in the pattern, as the probabilities for each position are calculated independently of other positions. Each column can therefore be regarded as an independent multinomial distribution (多项式分布).
我们称这个
E.g. for
Pseudocounts (or Laplace estimators) are often applied when calculating PPMs if based on a small dataset, in order to avoid matrix entries having a value of 0.
- 原先是
- 现在有 1st option: identically distributed pseudo-weight
- 2nd option: pseudo-weight distributed according to nucleotide priors
2. Background Models (Genomic Context)Permalink
以下内容摘自 RSA-tools: Sequence models。
Why do we need a background model? Any motif discovery relies on an underlying model to estimate the random expectation. The choice of an inappropriate model can lead to false conclusions. In practice, a sequence model can be used to generate random sequences, which will serve to validate some theoretical assumptions.
What is the probability for a given sequence segment (oligonucleotide, “word”) to be found at a given position of a DNA sequence? Different models can be chosen. 我们称 background model 为
2.1 Bernoulli ModelPermalink
- Assumes independence between successive nucleotides.
- The probability of each nucleotide is fixed a priori
- E.g.
, - 对任意一个 sequence
,我们假定 “ is a background sequence”,也就是说我们假定 “ is not a motif”,然后我们可以计算 “probability (likelihood) of given ”
- E.g.
- Particular case: equiprobable nucleotides
- I.e.
- Simple, but NOT realistic.
- I.e.
2.2 Markov ModelPermalink
- The probability of each nucleotide depends on the
preceding nucleotides. - The parameter
is called the order of the Markov model - N.B. a Markov model of order 0 is a Bernoulli model.
2.2.1 Transition MatrixPermalink
- Each row specifies one prefix (the
preceding nucleotides), each column one suffix (the current nucleotide). - Each value is
2.2.2 Model Estimation (Training)Permalink
这个
所以只要统计出 sequence collection 中所有
那么在 Markov model 下如何计算
- 一般来说,
,其中- E.g.
- E.g.
- 但是这样计算的工作量太大,我们这里使一个 trick:Markov Assumption
- Assume the probability of a character is only dependent on the previous character, not the entire prefix
- 这样就有
,其中- E.g.
- E.g.
- A statistical process that uses Markov Assumption is called a Markov Chain.
- 这个 Assumption 还可以推广到 Order
:
2.2.3 How to choose the sequence collection?Permalink
比如我们要 estimate the expected frequencies of length-
- whole yeast genome
- But this will bias the estimates towards coding frequencies, especially in microbial organisms, where the majority of the genome is coding.
- whole set of yeast intergenic sequences
- More accurate than whole-genome estimates, but still biased because intergenic sequences include both upstream and downstream sequences
- whole set of length-
yeast upstream sequences- Requires a calibration for each sequence size
- whole set of upstream sequences, fixed size (default on the web site)
- 意思说不管你
是多少,我都给你提供一个固定的 length- 的 collection - Reasonably good estimate for microbes, NOT for higher organisms.
- 意思说不管你
3. Position Weight MatrixPermalink
PWM 就是把 Motif Model
具体说来就是
when , indicating in position is favorable of the motif- 在 TFBM 的问题中,
说明 位上有一个 的话是容易被 TF 来 bind 的
- 在 TFBM 的问题中,
when , indicating in position is unfavorable of the motif
对一个 sequence
4. Sequence LogoPermalink
4.1 Shannon EntropyPermalink
Shannon Entropy is a measure of the uncertainty of a model, in the sense of how unpredictable a sequence generated from such a model would be.
For the single-nucleotide background model (i.e. Bernoulli model), the entropy is
Similarly, we can then compute the entropy at each position
回到 background entropy。The maximum entropy of
When the logarithms are base 2, the units for such a quantity is called “bits”, as is with BLAST scores. When using natural logs, the units are “nits”. We can think of this value of 2 bits as the information content associated with knowing a particular nucleotide. A bit of information can also be understood as the number of questions necessary to unambiguously determine an unknown nucleotide. You could ask, “Is it a purine?” If the answer is “no”, you could then ask is it C? The answer to the second question always guarantees, non-canonical nucleotides aside, the nucleotide’s identity.
4.2 Information Content / Logo HeightPermalink
注意我们说 Information Content 其实是 Motif 的 Information Content。
The Information Content of a motif at each position can be defined as the reduction in entropy. That is, the the motif provides information inasmuch as it reduces the uncertainty compared to the background model.
The Information Content of position
- For amino acids,
- For nucleic acids,
- 这里
就是上面的 - If
, :- No uncertainty at all: the nucleotide is completely specified (e.g.
)
- No uncertainty at all: the nucleotide is completely specified (e.g.
- If
, :- Uncertainty between two letters (e.g.
) - Need 1 extra bit to determine which nucleotide it is.
- Uncertainty between two letters (e.g.
- If
, :- Totally uncertainty (
) - 2 extra bits are required to specify a nucleotide in a 4-letter alphabet
- Totally uncertainty (
- 这里
The approximation for the small-sample correction,
where
这个
- 高度越高,说明与 background model 的差异越大,越接近于 motif
- 高低越低,说明与 background model 的差异越小,越不可能是 motif
Comments