Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
68
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

330

Dunbrack

proportion of each secondary structure type in the PDB. 100 would be the sample size for the prior. Because 100 is small compared to the data sample size from the PDB, the prior will have only a small influence on the posterior distribution. The likelihood is Multinom (yα(r), yβ(r), yγ(r): θα(r), θβ(r), θγ(r)). The posterior is Dirichlet(θα(r), θβ(r), θγ(r): yα(r) nα(r), yβ(r) nβ(r), yγ(r) nγ(r)).

III. APPLICATIONS IN MOLECULAR BIOLOGY

A. Dirichlet Mixture Priors for Sequence Profiles

One of the most important goals in bioinformatics is the identification and sequence alignment of proteins that are very distantly related by descent from a common ancestor. Such remote homolog detection methods can proceed in the absence of any structural information for the proteins involved, relying instead on multiple sequence alignments of protein families. Profile methods have a long history [49,50] and are based on the idea that even very distantly related proteins have particular patterns of hydrophobic and polar amino acids and some well-conserved amino acids at certain positions in the sequence. A profile consists of the probabilities of the 20 amino acids at each position in a multiple sequence alignment of a protein family. In some cases, a ‘‘gap’’ amino acid is also defined to indicate the probability of a gap at each position in the alignment. Physical information derived from the protein structure, such as hydrophobicity, can also be included [50].

One limiting factor in some cases is that there may be only a few sequences that are obviously related, and a profile built from their multiple alignment is not likely to be very accurate on statistical grounds of a small sample. For instance, for a buried residue, we might have only Val at a particular position in all of the known sequences. The question arises whether this position in an alignment of many more family members would be likely to have Ile and Leu as well as Val. If the conserved residue is Pro, what is the likelihood that in other related sequences this residue might be different? Bayesian statistics is ideally suited to treat the different situations we might encounter in sequence analysis within a single consistent framework. The key is to identify prior distributions for profiles that when combined with the data (the known relatives in the family) will produce the best estimate possible of the profile that might be derived from the full sequence family from a set of known sequences of any size.

A prior distribution for sequence profiles can be derived from mixtures of Dirichlet distributions [16,51–54]. The idea is simple: Each position in a multiple alignment represents one of a limited number of possible distributions that reflect the important physical forces that determine protein structure and function. In certain core positions, we expect to get a distribution restricted to Val, Ile, Met, and Leu. Other core positions may include these amino acids plus the large hydrophobic aromatic amino acids Phe and Trp. There will also be positions that are completely conserved, including catalytic residues (often Lys, Glu, Asp, Arg, Ser, and other polar amino acids) and Gly and Pro residues that are important in achieving certain backbone conformations in coil regions. Cys residues that form disulfide bonds or coordinate metal ions are also usually well conserved.

A prior distribution of the probabilities of the 20 amino acids at a particular position in a multiple alignment can be represented by a Dirichlet distribution, described in Section II.E. That is, it is an expression of the values of θr, the probabilities of each residue type r, where r ranges from 1 to 20, and 20r 1θr 1:

Bayesian Statistics

331

p(θ1, θ2, . . . , θ20 1, α2, . . . , α20) Dirichlet (θ1, θ2, . . . , θ20 1, α2, . . . , α20)

(35)

 

 

 

 

20

 

 

 

Γ(α0)

 

r 1

θrαr 1

 

20

 

r 1 Γ(αr )

 

 

α0 20r 1αr represents the total number of counts that the prior distribution represents, and the αr, the counts for each type of amino acid (not necessarily integers). Because different distributions will occur in multiple sequence alignments, the prior distribution for any position should be represented as a mixture of N Dirichlet distributions:

p(θ1, θ2, . . . , θ20 1,1, α1,2, . . . , αN,19, αN,20; q1, q2, . . . , qN)

N

qj Dirichlet (αj,1, αj ,2, . . . , αj ,20) (36)

j 1

where the qi represent the mixture coefficients and αij the count of the ith amino acid type in the jth component of the mixture. One can imagine components that represent predominantly a single amino acid type commonly conserved (Gly, Pro, Trp, Cys) and others that represent physical properties such as hydrophobicity and charge. We can establish such distributions from an understanding of what we expect to see in multiple sequence alignments, or the mixture can be derived from a set of sequence alignments and some optimization procedure.

Sjo¨lander et al. [16] describe the process assumed in their model of sequence alignments, which is how the counts for a particular position in a multiple sequence alignment would arise from the mixture Dirichlet prior:

1.A component j is chosen from among the N Dirichlet components in Eq. (36) according to their respective probabilities, qj.

2.A probability distribution (θ1, θ2, . . . , θ20) for the 20 amino acids is chosen from component j according to Dirichlet (αj,1, αj,2, . . . , αj,20).

3.A count vector (n1, n2, . . . , n20) for the position is drawn from the multinomial distribution given (θ1, θ2, . . . , θ20).

For a single Dirichlet component, Dirichlet (αj,1, αj,2, . . . , αj,20), the expected posterior probability for amino acid type i, given the counts for the 20 amino acids in a single position of the multiple alignment is

i

ni αj,i

(37)

n αj,0

 

 

where n is the total of the ni and αj,0 is the total of the αj,i.

With a mixture we have to factor in the posterior probability that component j is the correct component:

N

ni αj,i

 

i p(αj |n)

(38)

n αj,0

 

j 1

 

 

 

332 Dunbrack

where

 

 

 

 

p(αj |n)

p(nj)p(αj)

 

p(nj)qj

 

(39)

 

 

p(n)

 

 

p(n)

 

 

 

 

Γ(n)Γ(αj,0)

20 Γ(ni αj,i)

 

qj

 

i 1 Γ(ni)Γ(αj,i)

 

Γ(n αj,0)

 

The parameters {αj,i} and {qj} are determined from multiple sequence alignments, e.g., those available from the BLOCKS [55], FSSP [3], or HSSP [56] databases. For each position in each alignment, there is a vector of counts, nt, and the objective is to find the set of α and q that maximizes the probability of the counts arising from the Dirichlet mixture, Πt p(nt , nt). As an example, in Table 1 we list the vectors of α for a mixture of Dirichlet distributions calculated from BLOCKS alignments by Kevin Karplus using an expectation maximization algorithm [16]. A notation at the bottom of each column describes the type of position most likely to be represented by this mixture component (hydrophobic, charged residues, etc.). The last three components represent conserved cysteine, proline, and glycine residues. The other components represent various physical properties.

To demonstrate how the use of a Dirichlet mixture regularizes a profile derived from a small number of sequences, we used the mixture in Table 1 on a set of 10 nuclear hormone receptor (nhr) sequences from C. elegans and compared this with the actual data distribution from the 10 sequences and the data and regularized distributions derived from 100 nhr sequences from C. elegans [57]. The results are shown in Table 2 for a short segment of nhr proteins beginning with the second conserved cysteine in the first zinc finger. Cys 21 (numbering is based on the nhr-25 sequence) is very conserved, which is evident in the data profiles (ni/n) and the posterior modes of the regularized distributions (i) derived from Eq. (38). For 100 sequences, the data profile and regularized profiles are quite similar, which is what we would expect from a Bayesian analysis with a large amount of data. However, when there are only 10 sequences, the regularized profiles are different from the data profiles and are in fact more similar to the profiles derived from 100 sequences. This is the intent of the Bayesian analysis—to estimate parameter values from limited data that will be as close as possible to parameter values we would calculate if we had a large amount of data.

B. Bayesian Sequence Alignment

The primary tool of bioinformatics is sequence alignment, and numerous algorithms have been developed for this purpose, including variants of pairwise and multiple sequence alignments, profile alignments, local and global sequence alignments, and exact and heuristic methods. Sequence alignment programs usually depend on a number of fixed parameters, including the choice of sequence similarity matrix and gap penalty parameters. The correct choice of these parameters is a much debated topic, and the choice may depend on the purpose of the alignments: to determine whether two sequences are in fact related to each other or to determine the ‘‘correct’’ alignment between them, whatever that may be.

The Bayesian alternative to fixed parameters is to define a probability distribution for the parameters and simulate the joint posterior distribution of the sequence alignment and the parameters with a suitable prior distribution. How can varying the similarity matrix

Table 1

a

Dirichlet Mixture Components

a This 12-component mixture was database. The first line lists the qj each component.

derived by Kevin Karplus (UC Santa value for each component (see text).

Cruz), http://www.cse.ucsc.edu/research/compbio/dirichlets/index.html, from

The second line lists the total prior counts, α

0

. The last line provides a rough

the BLOCKS description of

Table 2

a

Raw Data and Posterior Modes from Dirichlet Mixtures for a Six Amino Acid Segment of Nuclear Hormone Receptors

334

a Sequences are all nhr-25 sequence.

from the C. elegans genome. Sequence and sequence numbering in the second and third columns refer to the

Dunbrack

Bayesian Statistics

335

and gap penalties be justified, case by case, when these parameters in fact determine the alignment? Similarity matrices, such as the PAM series [58–60], for instance, are parametrized as a function of evolutionary distance between the sequences to be compared. When comparing two sequences, we do not yet know the distance between them, and it is hard to say that a PAM250 or PAM100 matrix should be used. But certainly after we look at a proposed alignment, we have a better idea of what the distance is and what the matrix should be, even if this alignment would change with a different choice of matrix. Similarly, gap penalties may be different for different sequence pairs, because some proteins may be tolerant to insertions whereas others may not, depending on the pressure of natural selection. For instance, if two proteins are orthologs, i.e., they perform the same function in different species, we might expect fewer and shorter insertions and deletions. If two proteins are paralogs, i.e., homologous proteins that perform different functions in one species (or potentially in different species), then there may be more insertions and deletions to account for the change in function and/or substrate. Thus the use of fixed matrix and gap penalties is not entirely justifiable, other than for convenience.

Zhu et al. [15] and Liu and Lawrence [61] formalized this argument with a Bayesian analysis. They are seeking a joint posterior probability for an alignment A, a choice of distance matrix , and a vector of gap parameters, Λ, given the data, i.e., the sequences to be aligned: p(A, , Λ|R1, R2). The Bayesian likelihood and prior for this posterior distribution is

p(A, , Λ|R1, R2) p(R1, R2 |A, , Λ)p(A, , Λ) p(R1, R2)

(40)

p(R1, R2 |A, )p(A)p(Λ)p( ) p(R1, R2)

The second line results from the first, because and Λ are independent a priori, whereas the score of the alignment A depends on Λ, and the likelihood of the sequences, given the alignment, depends only on the scoring matrix .

The usual alignment algorithm fixes and Λ as 0 and Λ0, so that the prior is 1 when 0 and Λ Λ0 and is 0 otherwise. Clearly, if experience justifies this choice or some other nonuniform choice, we can choose an informative prior that biases the calculation to certain values for the parameters or limits them to some likely range. The likelihood is well defined by the alignment model defined by using a similarity matrix and affine gap penalties, so that

ln p(R1, R2 |A, )

 

AijΨR1

(i),R2(j)

(41)

 

i,j

The effect of the gap penalties occurs through the factor p(A

λkg(A)λlg(A) kg(A)

p(A) p(A | λ , λ ) 0 e

0 e λk0g(A)λleg(A) kg(A)

A

), which is given by

(42)

where λ0 and λe are the gap opening and extension parameters, respectively, and kg(A) and lg(A) are the total number and length of gaps in the alignment A. To obtain marginal posterior distributions for the gap penalties and scoring matrix, Eq. (40) must be summed

336

Dunbrack

over all alignments. For the gap penalties alone, we need to sum over both the alignments and the scoring matrices:

p(Λ|R1, R2)

p(R1, R2 |A, Θ)p(A )p(Λ)p(Θ)

(43)

p(R1, R2)

 

A Θ

 

 

 

The sum in Eq. (43) can be obtained by a recursion algorithm used commonly in dynamic programming [62].

C. Sequence–Structure Alignment

Threading methods have been developed by many groups over a number of years [18,49,63–74] and consist of three steps: (1) selecting a library of candidate folds usually derived from the Protein Data Bank (PDB) of experimentally determined protein structures; (2) selecting the most likely fold from the fold library for a protein sequence of unknown structure (the target sequence); and (3) simultaneously aligning the sequence of the target sequence to the structure (and sequence) from the fold library. Many factors can be considered in choosing the appropriate fold and aligning the sequence to the structure, including sequence similarity (with some amino acid substitution matrix such as the BLOSUM [75–77] or PAM [60] matrices); burial of hydrophobic amino acids; exposure of hydrophilic, especially charged, amino acids; contacts between hydrophobic amino acids (as distinct from simple burial from solvent); secondary structure predictions compared to the parent structure secondary structure; and others. Although the methods are designed to be applicable in cases where the target sequence is not, in fact, related to the parent by descent from a common ancestor (i.e., homologous), in practice most successful threadings are due to homologous relationships.

Lathrop and Smith [18] and Lathrop et al. [19], in a series of papers, have given a Bayesian theory of sequence–structure alignment. Their goal is to derive a model for protein threading that is Bayes-optimal, that is, selecting both the fold and sequence alignment that have the highest joint probability over all possible folds and sequence alignments. The advantage of the formalism they develop is that most other threading scoring functions can be expressed within their framework. The probability they wish to calculate is p(C, t |a, n), where C is the protein fold or ‘‘core’’ selected from the fold library, t is the alignment of the target sequence to the fold sequence and structure, a is the sequence of the target (assumed known), and n is its length.

Before setting up priors and likelihoods, we can factor the joint probability of the core structure choice and the alignment t by using Bayes’ rule:

p(C, t|a, n) p(t |a, n, C)p(C|a, n)

(44)

[This can be thought of as follows: If p(C, t) p(t |C)p(C), then we simply add the conditionality on a, n to each term to achieve Eq. (44).] Lathrop and coworkers provide a likelihood and prior for each term in Eq. (44). For the first term, considering y a as the data and θ t as the parameter we are looking for, and leaving n, C where they are on the right-hand side of the vertical bar [i.e., read Eq. (45) without the n, C, and it reads like the standard Bayesian equation p(θ|y) p(y)p(θ)/p(y)], we have

p(t |a, n, C)

p(a|t, n, C)p(t |n, C)

 

(45)

p(a |n, C)

 

 

Bayesian Statistics

337

For the second term, we use the same trick: Leave n on the right-hand side of the vertical and use the Bayesian statistics equation to invert C and a:

p(C|a, n) p(a|C, n)p(C|n) p(a|n)

Multiplying out Eqs. (45) and (46) and canceling the p(a|n, C) have

p(C, t |a, n) p(a |t, n, C) p(t |n, C)p(C|n) p(a|n)

(46)

p(a |C, n) terms, we

(47)

where the last term forms the prior distribution normalized by the probability of the data (i.e., the sequence and its length), and the first term on the right is the likelihood function of the sequence given the alignment, its length, and the choice of core structure. What remains is to decide functional forms for each of these terms.

For p(C|n), Lathrop et al. provide a noninformative prior, p(C|n) 1/L, where L is the size of the library. A slightly more complicated prior could take into account that some folds are more common than others and that some folds are more common in some phyla than others [78]. Given certain rules that constitute a proper threading, t, we can establish how many threadings there can be for a given core structure C and sequence length n of the target. Their model for protein threading consists of core structures that contain only regular secondary structures (just helices and sheet strands, no loops or irregular secondary structure) and a constraint that no gap can occur within a secondary structure unit of the core structure. Under these constraints a finite number of threadings are possible, T(n, C). The noninformative prior for p(t |n, C) is then just 1/T(n, C). A more elaborate prior for p(t |n, C) is described in [18,19] that takes into account likely loop lengths in real proteins and coverage of the core elements by residues from the sequence. Even in the absence of knowledge of the target sequence, some alignments are more likely than others. Finally, for the prior, p(a|n) is a constant and can be ignored because we are looking for relative probabilities of the choice of fold and alignment.

The likelihood function is an expression for p(a |t, n, C), which is the probability of the sequence a (of length n) given a particular alignment t to a fold C. The expression for the likelihood is where most threading algorithms differ from one another. Since this probability can be expressed in terms of a pseudo free energy, p(a |t, n, C) exp[ f(a, t, C)], any energy function that satisfies this equation can be used in the Bayesian analysis described above. The normalization constant required is akin to a partition function, such that

p(a |t, n, C)

exp[ f(a, t, C)]

(48)

i exp[ f(a, ti, C)]

 

 

where the sum is over all possible alignments. The final result for the posterior distribution is

 

exp[ f(a, t, C)]

1

 

 

p(C, t |a, n)

 

 

 

 

(49)

i exp[ f(a, ti, C)]

LT(n, C)p(a |n)

The success of any method using this formalism will be entirely dependent on the function f used and the search strategy. Lathrop et al. [19] use a branch-and-bound algo-