Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1Foundation of Mathematical Biology / Foundation of Mathematical Biology Statistics Lecture 3-4

.pdf
Скачиваний:
45
Добавлен:
15.08.2013
Размер:
248.78 Кб
Скачать

BP-203 Foundations for Mathematical Biology

Statistics Lecture III

By Hao Li

Nov 8, 2001

Statistical Modeling and Inference

data collection

constructing probabilistic model

inference of model parameters

interpreting results

making new predictions

Maximum likelihood Approach

Example A: Toss a coin N times, observe m heads in a specific sequence

Model: binomial distribution Inference: the parameter p

Prediction: e.g., how many heads will be observed for another L trials

Prob. of observing a specific sequence of m heads

P(m | p) = pm (1 p)N m

Find a p such that the above prob. is maximized

log P(m | p) pˆ = 0

p

pˆ = m / N

log P(m | pˆ ) = N[pˆ log pˆ + (1 pˆ ) log(1 pˆ )]

-entropy

How good is the estimate?

Distribution of pˆ under repeated sampling

Central limit theorem Æ distribution of m approaches normal for large N

m ~ Np ± Np(1 p)

pˆ ~ p ± p(1 p) / N

Thus the estimate converges to the real p with a square-root convergence

Maximum likelihood Approach

Example B: x1, x2 ,..., xN

independent and identically distributed (i.i.d) sample drawn from a normal distribution N (µ ,σ 2 )

Estimate the mean and the variance

Maximizing the likelihood function (show this is true in the homework )

N

µˆ = x = xi / N

i=1

N

σˆ 2 = (xi x)2 / N

i=1

General formulation of

the maximum likelihood approach

D: observed data

M: the statistical model

θparameters of the model

P(D | M ,θ )

probability of observing the data

given the model and parameters

L(θ ; D) P(D | M ,θ )

the likelihood of θ as a function of data

Maximum likelihood estimate of the parameters

θˆ = arg max L(θ ; D)

Theorem: θˆ converges to the true θ 0 in the large sample limit with error ~ 1/ N

Example C: Segmentation

a sequence of head (1) and tail (0) is generated by first using a coin with p1 and then change to a coin with p2 the change point unknown

Data = (001010000000010111101111100010)

 

m1

( x)

(1

p1 )

xm ( x)

m2

( x)

(1

p2 )

N xm ( x)

P(seq, x | p1, p2 ) = p1

 

1

p2

 

2

 

 

 

 

 

 

 

 

 

x

position right before the change

 

 

 

 

 

m (x)

number of 1’s up to x

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

m (x)

number of 1’s after x

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

N

total number of tosses

Example C continued

 

For fixed x maximize P(seq, x | p1, p2 )

with respect to p1 and p2

log P(seq, x | pˆ1, pˆ2 ) = x[pˆ1 log pˆ1 + (1 pˆ1 ) log(1 pˆ1 )] +

(N x)[pˆ2 log pˆ2

+ (1 pˆ2 ) log(1 pˆ2 )]

pˆ1 = m1 (x) / x

pˆ2 = m2 (x) /(N x)

Then maximize P(seq, x | p1

, p2 ) with respect to

x

ˆ

ˆ

 

The above approach is sometime referred as “entropic segmentation”, as it tries to minimize the total entropy

A generalization of the above model to 4 alphabet and unknown number of breaking points can be used to segment DNA sequences into regions of different composition. more naturally described by a

hidden Markov model.

Example D: detecting weak common sequence patterns in a set of related sequences

e.g., local sequence motifs for functionally or structurally related proteins (no overall sequence similarity)

regulatory elements in the upstream regions of co-regulated genes, could be genes clustered together by microarray data

the simplest situation: each sequence contain one realization of the

motif with given length, but the starting positions are unknown

Example: 22 genes identified as pho4 target by microarray, O’shea lab

YAR071W:600:-600 \catcaagatgagaaaataaagggattttttcgttcttttatcattttctctttctcacttccgactacttcttatatctactttcatcgtttcattcatcgtgggtgtctaataaagtttta atgacagagataaccttgataagctttttcttatacgctgtgtcacgtatttattaaattaccacgttttcgcataacattctgtagttcatgtgtactaaaaaaaaaaaaaaaaaa gaaataggaaggaaagagtaaaaagttaatagaaaacagaacacatccctaaacgaagccgcacaatcttggcgttcacacgtgggtttaaaaaggcaaattacacag aatttcagaccctgtttaccggagagattccatattccgcacgtcacattgccaaattggtcatctcaccagatatgttatacccgttttggaatgagcataaacagcgtcgaa ttgccaagtaaaacgtatataagctcttacatttcgatagattcaagctcagtttcgccttggttgtaaagtaggaagaagaagaagaagaagaggaacaacaacagcaaa gagagcaagaacatcatcagaaatacca\

YBR092C:600:-600 \aatcaatgacttctacgactatgctgaaaagagagtagccggtactgacttcctaaaggtctgtaacgtcagcagcgtcagtaactctactgaattgaccttctactgggac tggaacactactcattacaacgccagtctattgagacaatagttttgtataactaaataatattggaaactaaatacgaatacccaaattttttatctaaattttgccgaaagatta aaatctgcagagatatccgaaacaggtaaatggatgtttcaatccctgtagtcagtcaggaacccatattatattacagtattagtcgccgcttaggcacgcctttaattagca aaatcaaaccttaagtgcatatgccgtataagggaaactcaaagaactggcatcgcaaaaatgaaaaaaaggaagagtgaaaaaaaaaaaattcaaaagaaatttacta aataataccagtttgggaaatagtaaacagctttgagtagtcctatgcaacatatataagtgcttaaatttgctggatggaagtcaattatgccttgattatcataaaaaaaata ctacagtaaagaaagggccattccaaattacct\

YBR093C:600:-600 \cgctaatagcggcgtgtcgcacgctctctttacaggacgccggagaccggcattacaaggatccgaaagttgtattcaacaagaatgcgcaaatatgtcaacgtatttgg aagtcatcttatgtgcgctgctttaatgttttctcatgtaagcggacgtcgtctataaacttcaaacgaaggtaaaaggttcatagcgctttttctttgtctgcacaaagaaatata tattaaattagcacgttttcgcatagaacgcaactgcacaatgccaaaaaaagtaaaagtgattaaaagagttaattgaataggcaatctctaaatgaatcgatacaaccttg gcactcacacgtgggactagcacagactaaatttatgattctggtccctgttttcgaagagatcgcacatgccaaattatcaaattggtcaccttacttggcaaggcatatac ccatttgggataagggtaaacatctttgaattgtcgaaatgaaacgtatataagcgctgatgttttgctaagtcgaggttagtatggcttcatctctcatgagaataagaacaa caacaaatagagcaagcaaattcgagattacca\

YBR296C:600:-600 \gaaatctcggtttcacccgcaaaaaagtttaaatttcacagatcgcgccacaccgatcacaaaacggcttcaccacaagggtgtgtggctgtgcgatagaccttttttttctt tttctgctttttcgtcatccccacgttgtgccattaatttgttagtgggcccttaaatgtcgaaatattgctaaaaattggcccgagtcattgaaaggctttaagaatataccgtac aaaggagtttatgtaatcttaataaattgcatatgacaatgcagcacgtgggagacaaatagtaataatactaatctatcaatactagatgtcacagccactttggatccttcta ttatgtaaatcattagattaactcagtcaatagcagattttttttacaatgtctactgggtggacatctccaaacaattcatgtcactaagcccggttttcgatatgaagaaaattat atataaacctgctgaagatgatctttacattgaggttattttacatgaattgtcatagaatgagtgacatagatcaaaggtgagaatactggagcgtatctaatcgaatcaatat aaacaaagattaagcaaaaatg\