Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
68
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

338

Dunbrack

rithm to search the space of legal threadings and a function f that does not depend on residue interactions. The details of these aspects are not inherently Bayesian in character and are outside the scope of this review.

IV. APPLICATIONS IN STRUCTURAL BIOLOGY

A. Secondary Structure and Surface Accessibility

A common use of statistics in structural biology is as a tool for deriving predictive distributions of structural parameters based on sequence. The simplest of these are predictions of secondary structure and side-chain surface accessibility. Various algorithms that can learn from data and then make predictions have been used to predict secondary structure and surface accessibility, including ordinary statistics [79], information theory [80], neural networks [81–86], and ‘‘Bayesian’’ methods [87–89]. A disadvantage of some neural network methods is that the parameters of the network sometimes have no physical meaning and are difficult to interpret.

Unfortunately, some authors describing their work as ‘‘Bayesian inference’’ or ‘‘Bayesian statistics’’ have not, in fact, used Bayesian statistics; rather, they used Bayes’ rule to calculate various probabilities of one observed variable conditional upon another. Their work turns out to comprise derivations of informative prior distributions, usually of the form p(θ1, θ2, . . . , θN) ΠNi 1 p(θi), which is interpreted as the posterior distribution for the θi without consideration of the joint distribution of the data variables in the likelihood.

For example, Stolorz et al. [88] derived a Bayesian formalism for secondary structure prediction, although their method does not use Bayesian statistics. They attempt to find an expression for p(µ|seq) p(seq)p(µ)/p(seq), where µ is the secondary structure at the middle position of seq, a sequence window of prescribed length. As described earlier in Section II, this is a use of Bayes’ rule but is not Bayesian statistics, which depends on the equation p(θ|y) p(y)p(θ)/p(y), where y is data that connect the parameters in some way to observables. The data are not sequences alone but the combination of sequence and secondary structure that can be culled from the PDB. The parameters we are after are the probabilities of each secondary structure type as a function of the sequence in the sequence window, based on PDB data. The sequence can be thought of as an explanatory variable. That is, we are looking for

p(θα(x), θβ(x), θγ(x)|yα(x), yβ(x), yγ(x))

p(yα(x), yβ(x), yγ(x)α(x), θβ(x), θγ(x)), p(θα(x), θβ(x), θγ(x)) (50)

where, for example, θα(x) is the probability of an α-helix at the middle position of a string of amino acid positions in a protein with sequence x. Because we are dealing with count data, the appropriate functional forms for Eq. (50) are a multinomial likelihood and Dirichlet functions for the prior and posterior distributions. For any window of length N there are 20N sequences possible, so there are insufficient data in the PDB for these calculations for any N 5. There are three possible solutions:

1.Reduce the size of the alphabet from 20 amino acids to a smaller alphabet of, say, six (aliphatic, aromatic, charged, polar, glycine, proline).

Bayesian Statistics

339

2.Use very informative priors, where perhaps the prior could be based on the product of individual probabilities for each secondary structure type and amino acid type.

3.Do both of the above.

In this case, we are looking for ‘‘counts’’ for each secondary structure type µ for each sequence x, which might be derived from PDB data by

N

 

 

n i 1

p(µ|ri)

(51)

The constant of proportionality is based on normalizing the probability and establishing the size of the prior, that is, the number of data points that the prior represents. The advantage of the Dirichlet formalism is that it gives values for not only the modes of the probabilities but also the variances, covariances, etc. See Eq. (13).

Thompson and Goldstein [89] improve on the calculations of Stolorz et al. by including the secondary structure of the entire window rather than just a central position and then sum over all secondary structure segment types with a particular secondary structure at the central position to achieve a prediction for this position. They also use information from multiple sequence alignments of proteins to improve secondary structure prediction. They use Bayes’ rule to formulate expressions for the probability of secondary structures, given a multiple alignment. Their work describes what is essentially a sophisticated prior distribution for θ (X), where X is a matrix of residue counts in a multiple alignment in a window about a central position. The PDB data are used to form this prior, which is used as the predictive distribution. No posterior is calculated with posterior prior likelihood.

A similar formalism is used by Thompson and Goldstein [90] to predict residue accessibilities. What they derive would be a very useful prior distribution based on multiplying out independent probabilities to which data could be added to form a Bayesian posterior distribution. The work of Arnold et al. [87] is also not Bayesian statistics but rather the calculation of conditional distributions based on the simple counting argument that p(σ|r) p(σ, r)/p(r), where σ is some property of interest (secondary structure, accessibility) and r is the amino acid type or some property of the amino acid type (hydrophobicity) or of an amino acid segment (helical moment, etc).

B.Side-Chain Conformational Analysis

Analysis and prediction of side-chain conformation have long been predicated on statistical analysis of data from protein structures. Early rotamer libraries [91–93] ignored backbone conformation and instead gave the proportions of side-chain rotamers for each of the 18 amino acids with side-chain dihedral degrees of freedom. In recent years, it has become possible to take account of the effect of the backbone conformation on the distribution of side-chain rotamers [28,94–96]. McGregor et al. [94] and Schrauber et al. [97] produced rotamer libraries based on secondary structure. Dunbrack and Karplus [95] instead examined the variation in χ1 rotamer distributions as a function of the backbone dihedrals φ and ψ, later providing conformational analysis to justify this choice [96]. Dunbrack and Cohen [28] extended the analysis of protein side-chain conformation by using Bayesian statistics to derive the full backbone-dependent rotamer libraries at all

340

Dunbrack

values of φ and ψ. This library has proven useful in side-chain conformation prediction [43], NMR structure determination [98], and protein design [99]. The PDB has grown tremendously in the last 10 years, so whereas Ponder and Richard’s library [93] was based on 19 chains in 1987, we based our most recent backbone-dependent rotamer library [1]

˚

on 814 chains from the PDB with resolution of 2.0 A or better, R-factor less than 20%, and mutual sequence identity less than 50%.

Even though there are substantially more data for side chains than in the 1980s, the data are very uneven. Some residue types (Asp, Lys, Ser) are common, and some (Cys, Trp, Met) are not. The long side chains (Met, Glu, Gln, Lys, Arg) have either 27 or 81 possible rotamers, some of which have been seen very rarely or not at all because of the large steric energies involved. The uneven distribution of residues across the Ramachandran map is also problematic, because for prediction purposes we would like to have probabilities for the different rotamers even in unusual backbone conformations where there may be only a few cases in the PDB or again none at all. Bayesian statistics was the most logical choice for overcoming this unevenness in data, providing a unified framework for adjusting between data-rich situations and data-poor situations where an informative prior distribution could make up for the lack of data.

The parameters we are searching for are the probabilities as well as the average values of the χ angles for the χ1, χ2, χ3, χ4 rotamers given values for the explanatory variables φ and ψ. To distinguish between side-chain dihedrals and rotamers we denoted the χ1 rotamer as r1, the χ2 rotamer as r2, etc. For sp3sp3 dihedrals, there are three welldefined rotameric positions: χ 60° (g ), χ 180° (t), and χ 60° (g ). For sp3sp2 dihedrals (e.g., in aromatic amino acids, Asn, Asp, Glu, Gln), the situation is more complicated. This work proposed what is essentially a mixture model for side-chain conformations, although not explicitly described as such in Ref. [28], using normal distribution functions where the µ’s and σ’s depend on the rotamer type and backbone conformation, as do the mixture coefficients, the amount of each component in the mixture. We denote the mixture coefficients for each region of the φ, ψ map as follows:

θijkl|ab p(r1 i, r2

j, r3 k, r4

la 10° φ

(51a)

φa 10°, ψb 10° ψ ψb 10°)

 

We denote the mean χ angles and their variances σ2 similarly as µ(ijklm)|ab, where (m) represents the χ angle (1, 2, 3, or 4) for the ijklth rotamer, and σijkl2(m|)ab the variance. To simplify the model a bit, we set the covariances to 0. The full model is therefore

3

3

3

4

 

 

3

 

 

p(χ1, χ2, χ3, χ4 a, ψb) θijkl|ab m 1

N(µijkl(m)|ab, σijkl2(m|)ab)

(52)

i 1 j 1 k 1 l 1

 

 

where N signifies a normal distribution (Eq. 14). This is a likelihood function for the χ angles given their explanatory variables φ, ψ, i.e., p(y), where the y are the χ angles in each φ, ψ region and the θ are the mixture coefficients and the mean and variance parameters in the normal distributions. Ordinarily, which component a particular data point (a single side chain conformation and its backbone conformation) belongs to is left as missing data. However, in our case we know that our χ angles are trimodal with well-spaced modes (near 60°, 180°, and 60°), and we can assign each side chain to a particular rotamer unambiguously. We use Bayesian statistics to determine the parameters in Eq. (52).

Because each side chain can be identifiably assigned to a particular component, the mixture coefficients and the normal distribution parameters can be determined separately.

Bayesian Statistics

341

For the θijkl|ab we use Dirichlet priors combined with a multinominal likelihood to determine a Dirichlet posterior distribution. The data in this case are the set of counts {nijkl|ab}. We determined these counts from PDB data (lists of values for {φ, ψ, χ1, χ2, χ3, χ4}) by counting side chains in overlapping 20° 20° square blocks centered on (φa, ψb) spaced 10° apart. The likelihood is therefore of the form

 

nab!

nijkl|ab

 

p({nijkl|ab}ijkl|ab)

 

θijkl|ab

(53)

 

 

 

i,j,k,l

 

nijkl|ab i,j,k,l

The major benefit of using Bayesian statistics for side-chain conformational data is to account for the uneven distribution of data across the Ramachandran map and the ability to determine large numbers of parameters from limited data using carefully chosen informative prior distributions. Prior distributions can come from any source, either previous data or some pooling of data that in the true posterior would be considered distinct. For instance, if we considered it likely that the distributions of two different amino acid types are likely to be very similar, we might pool their conformations to determine the prior distribution and then determine separate posterior distributions for each type by adding in the data for that type only in the likelihood. We can use a process of building prior distributions by combining posterior distributions for subsets of the parameters. The prior distribution for the {θijkl|ab} is a Dirichlet distribution where the counts, αijkl|ab, are determined as follows:

ˆ ˆ ˆ

(54)

αijkl|ab Kθij|aθij|bθkl|ij

where K represents the total weight to the prior and the θ’s come from the expectation value of the posterior distributions for these parameters. These variables have posterior distributions

p(θij|a) Dirichlet({nij|a αij|a})

(55)

where the prior parameters {αij|a} are determined from the data as αij|a K(ni|a/na)(nj|i/nj), with K as some constant. In principle, the fractions in this expression could also be obtained from posterior distributions based on informative prior distributions, but at some point there are sufficient data to work directly from the data to determine the priors. The other factors in Eq. (54) are determined similarly.

The χ angle distributions are also determined by deriving posterior distributions from data and prior distributions that are themselves determined from posterior distributions and their prior distributions and data. The appropriate expressions are given in Eq. (17). In our previous work we did not put a prior distribution on the variances. This resulted in wildly varying values for the variances, especially for rotamers that are rare in the database. If there are only a few side chains in the count, their variance will be outweighed by the prior estimate if there is enough weight on the prior values in the Inv χ2 distribution (the degrees of freedom, ν0). Also, when there is only one count or no counts, a reasonable value for the variance is determined from the prior.

As an example of analysis of side-chain dihedral angles, the Bayesian analysis of methionine side-chain dihedrals is given in Table 3 for the r1 g rotamers. In cases where there are a large number of data—for example, the (3, 3, 3) rotamer—the data and posterior distributions are essentially identical. These are normal distributions with the averages and standard variations given in the table. But in cases where there are few data,

342

Dunbrack

Table 3 Bayesian Analysis of Methionine Side Chain Dihedral Anglesa

a Average data dihedrals along with the prior and posterior modes and data, prior, and posterior standard deviations are given for the g-rotamer. Rotamers with syn-pentane steric interactions are italicized.

e.g., the (3, 1, n) rotamers, the prior exhibits a strong pull on the data to yield a posterior distribution that is between their mean values. To get a feeling for the distributions involved, in Figure 2 we show the data distribution along with the posterior distribution of the mean and the posterior predictive distributions for χ3 of the (3, 1, 3), (3, 2, 3), and (3, 3, 3) rotamers. The posterior distribution for the values of µ(3)ijk was generated from a simulation of 1000 draws. This was achieved first by drawing values for the variance from its posterior distribution, a scaled inverse χ2 distribution [Eqs. (16) and (17)] derived from the values of the prior variance and the data variance. With each value of σ2 a draw was taken from the posterior distribution for the χ3 angle. This distribution is normal with a variance roughly equal to σ2n/κn. For large amounts of data, for example the (3, 3, 3) rotamer, this produces a very precise estimate of the value of µ(3)ijk . For small amounts of data, the spread in µ(3)ijk is larger. We can also predict future data, denoted , from the likelihood and the posterior distribution of the parameters. This is called the posterior

Bayesian Statistics

343

Figure 2 Data distribution and draws from the posterior distribution (mu sim) and posterior predictive distributions (data sim) for methionine side chain dihedral angles. The results for three rotamers are shown, (r1 3; r2 1; r3 3), (3, 2, 3), and (3, 3, 3). Each simulation consisted of 1000 draws, calculated and graphed with the program S-PLUS [109].