Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
68
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

Bayesian Statistics

317

p(θ|y)

p(y)p(θ)

(4)

Θ p(y)p(θ)dθ

 

 

where the sum over hypotheses in Eq. (2a) has been replaced with an integral over Θ, the allowed values of the parameter θ. For example, if θ is a proportion, then the integral is from 0 to 1. The prior probability p(θ) is a prior probability distribution, a continuous function of θ that expresses any knowledge we have about likely values for θ before we consider the new data at hand. This probability distribution may be flat over the entire range of θ. Such a prior is referred to as uninformative (as with A, B, C’s equal prior probability of tenure above). Or we may have some prior data so that we know that θ is more likely to be in a particular range of Θ than outside that range, so the probability should be higher in that range than outside it. This is an informative prior. Under ordinary circumstances, the prior should be proper, that is, normalized to integrate to 1 with respect to θ: Θ p(θ) dθ 1. The likelihood, by contrast, should be normalized with respect to integration over the data for given θ: y p(y)dy 1. If y is discrete (count data), then this is a sum over all possible y. The posterior distribution p(θ|y) is a continuous function of the parameter θ for known data y. It can be used for any inference on the value of θ desired. For instance, the probability that θ is greater than some value θ0 is P(θ θ0 |y)θθmax0 p(θ|y). Ninety-five percent confidence intervals can be calculated easily by choosing θ1 and θ2 such that θθ21 p(θ|y) dθ 0.95. This alternative notation for parameters and data is common in the statistical literature, and we will use it throughout the rest of this chapter.

As an example of a continuous variable, we can calculate a posterior distribution for the proportion of red and green balls in a barrel sitting before us. You are asked to bet whether it is the red balls or the green balls that are more plentiful in the barrel. You have $10 in your possession, and you decide to bet in proportion to your certainty in your opinion on the red or green majority. You get a very quick look at the open barrel and estimate that the number of red and green balls look approximately equal. As a prior probability you would use a probability distribution that is a bell-shaped curve centered around θred of 0.5. (The appropriate mathematical form is a beta distribution, described below.) But your look was very brief, and you do not have a lot of confidence in this prior view. You are given a sample of balls from the barrel, which consists of 7 red balls and 3 green balls. It is clear you should bet on red, but how much? The likelihood function in this case is a binomial distribution. It gives the probability of nred and ngreen for a sample of N nred ngreen balls, given θred and θgreen 1 θred. The results are shown in Figure 1. The dashed line gives the posterior distribution of θred, the solid line gives the prior distribution, and the dotted line gives the likelihood of the data. The figure shows that the posterior is a compromise between the prior and the likelihood. We can integrate the posterior distribution to decide the amount of the bet from

1 p(θred |nred 7; ngreen 3)dθred 0.72 0.5

So we bet $7.20 on red.

C. Frequentist Probability Theory

It should be noted that the Bayesian conception of probability of a hypothesis and the Bayesian procedure for assessing this probability is the original paradigm for probabilistic

318

Dunbrack

Figure 1 (——) Prior, ( ) data (likelihood), and (---) posterior distributions for estimating a proportion of red and green balls in a barrel. The prior is based on a sample of 40 balls with 20 of them red, Beta (20,20). The likelihood is shown for a draw of 10 balls from the barrel, seven of them red. The posterior distribution is Beta (27,23).

inference. Both Bayes [32] and Laplace [33] used Bayes’ rule to make probability statements for hypothetical statements given observed data, p(H|D), by inverting the likelihood function, p(D|H) [or, more accurately, determining p(θ|y) by inverting p(y)]. But in the nineteenth and early twentieth centuries the idea that a hypothesis could have a probability associated with it seemed too subjective to practitioners of the new science of statistics. They were also uncomfortable with the notion of prior probabilities, especially when applied to continuous parameters, H θ. To deal with these problems, they simply removed these elements of statistical reasoning.

In frequentist statistics, probability is instead a long-run relative occurrence of some event out of an infinite number of repeated trials, where the event is a possible outcome of the trial.* A hypothesis or parameter that expresses a state of nature cannot have a probability in frequentist statistics, because after an infinite number of experiments there can be no uncertainty in the parameter left. A hypothesis or parameter value is either

* The field of statistics as a separate discipline began in the early to mid nineteenth century among German, British, French, and Belgian social reformers, referred to as statists: i.e., those that were concerned with numbers related to the state, including crime rates, income distributions, etc. [34] The appeal of frequency-based interpretation of probability would have been natural in the study of large human populations.

Bayesian Statistics

319

always true or always false. Because a hypothesis cannot have a probability, frequentist probability is restricted to inference about data given a single hypothesis or a single value for the parameter. Data can be assigned probabilities because they can have varying values due to experimental error, even under a constant hypothesis.

The usual frequentist procedure comprises a number of steps [11]:

1.Define a procedure Π for selecting a hypothesis based on some characteristic(s) of the data, S(y). S is called a statistic. Often this will be a null hypothesis that

is deliberately at odds with S(y). So, for instance, whereas the characteristic of the data might be the sample average y or the variance σ2, the hypothesis H might be that the parameter θ (what y is measuring) has a value of 0.* Or if we are trying to show that two samples taken under different conditions are really different, we might define H to be the statement that the sample averages are in fact equal.

2.Because the data y are random, the statistics based on y, S(y), are also random.

For all possible data y (usually simulated) that can be predicted from H, calculate

p(S(ysim)|H), the probability distribution of the statistic S on simulated data ysim given the truth of the hypothesis H. If H is the statement that θ 0, then ysim might be generated by averaging samples of size N (a characteristic of the actual

data) with variance σ2 σ2 (yactual) (yet another characteristic of the data).

3.Compare the statistic S calculated on the actual data yactual to the distribution p(S(ysim)|H). If S(yactual) p(S(ysim)|H) dS(ysim) is very small ( 0.05, for in-

stance), then reject the hypothesis H. If S (yactual) falls just to the right of 95% of the simulated S(y), then we can conclude that θ 0 by rejecting the null hypothesis.†

If we do this over and over again, we will have done the right thing 95% of the time. Of course, we do not yet know the probability that, say, θ 5. For this purpose, confidence intervals for θ can be calculated that will contain the true value of θ 95% of the time, given many repetitions of the experiment. But frequentist confidence intervals are actually defined as the range of values for the data average that would arise 95% of the time from a single value of the parameter. That is, for normally distributed data,

Pr y σN θ y σN 0.95

Bayesian confidence intervals, by contrast, are defined as the interval in which.

Pr θ σN y θ σN 0.95

* Note that the bar above y in y in this section denotes the average of y. A bar over a statement or hypothesis A in the previous section was used to denote not-A. Both of these are standard notations in statistics and probability theory, respectively.

†The futility of frequentist testing of a point null hypothesis has been examined at length by Berger and Delampady [35].

320

Dunbrack

The frequentist interval is often interpreted as if it were the Bayesian interval, but it is fundamentally defined by the probability of the data values given the parameter and not the probability of the parameter given the data.

D. Bayesian Methods Are Superior to Frequentist Methods

The Bayesian and frequentist theories can be considered two competing paradigms (in the sense of Kuhn [36]) of what the word ‘‘probability’’ means and how we should make inferential statements about hypotheses given data [37]. It is an unusual situation in the history of science that there should be two competing views on such basic notions, both of which have sizable entrenched camps of adherents, and that the controversy has lasted so long. Bayesian views fell out of favor until the book of Jeffreys in 1939 [38] and the work of Jaynes [30] and Savage [39] in the 1950s. Since then the Bayesian camp has increased tremendously in size. Because of some of the computational difficulties in evaluating posterior distributions, the advent of Markov chain Monte Carlo methods and fast computers has vastly increased the power and flexibility of Bayesian methods [40]. It is impossible to review the controversy in great depth here (see Refs. 9–11 and 37), but I will make some arguments in favor of the Bayesian view for molecular and structural biology.

1.The Bayesian View of Probability Corresponds to Most Scientists’ Thinking

Bayesian inference is a process of taking current views of the probability that competing hypotheses about nature might be true and updating these beliefs in light of new data. It corresponds to the daily experience of scientists and nonscientists alike of judgments made on past experience and present facts. As Good [41] has argued, all animals must be at least informal Bayesians, even non-Bayesian statisticians, because they evaluate possibilities and outcomes of their actions in light of their probabilities for success. One might argue that sane dogs are better Bayesians than humans, given our propensity for foolish and destructive behavior, regardless of prior experience.

Bayesian methods are quite similar to most scientists’ basic intuition about the nature of data and fundamental physical processes. The use of the prior distribution formalizes what we do naturally when we restrict our view of any parameter to a certain range of values or a certain number of outcomes and when we treat outliers with suspicion. If we get a value for an experiment that yields a parameter value that seems absurd compared to previous experiments, we are likely to repeat the experiment rather than publish the absurd result. Implicitly, we set the prior distribution of the parameters that are outside our range of interest or unphysical or unlikely on some ground to 0. The remaining probability density of the parameters must lie within the range we believe to be at least remotely possible and integrates to 1 within this range. Although basic statistics textbooks spend much time discussing null hypotheses, t-tests, F tests, etc., scientists rarely use this language in assessing data and inferred values of parameters based on the data. We do not usually ask whether the value of some parameter is exactly θ0 or simply whether it is greater than some value θ0. Rather we want to know what is the most likely range of values for the parameter, which is inherently a probability distribution over the parameter.

By contrast, the frequentist view is often contrary to common sense and common scientific practice. The classic example of this is the stopping rule problem [42]. If I am

Bayesian Statistics

321

trying to decide whether a coin is fair, I can set up an experiment in two different ways. I can throw the coin some number of times N and then record the number of heads. Or I can keep throwing the coin until I observe some preset number of heads, nheads. Suppose one person throws the coin N 100 times and observes nheads 55 and another person throws the coin until 55 heads are observed and in fact it takes 100 throws. To a frequentist, these are quite different experiments and result in different inferences on the fairness of the coin. This is because the experiment that would be repeated an infinite number of times to obtain pheads is different in each case (in the first case, a binomial distribution is used to make the inference; in the second case, a negative binomial is used). This seems absurd to most people, and to a Bayesian the probability of a fair coin is the same, because the data are in fact the same in each case.

The previous example also highlights the controversy of the subjectivity or objectivity of the two competing views of probability. Frequentist probability theory arose because of the apparent subjectivity of Bayesian prior distributions. But it replaced this subjectivity with a set of procedures based on test statistics. Inference is based on the probability of the test statistic, calculated on hypothetical data consistent with the null hypothesis, being as large as or larger than the test statistic calculated on the real data. But the stopping rule example exemplifies that the nature of the repeated experiment, the sampling space, is in itself arbitrary, as is the functional form of the test statistic. It is the emphasis on data not seen that makes frequentist statistics unintuitive and gives statistics much of its reputation of being difficult for beginning students.

2. Bayesian Methods Perform Better

In biology we are often faced with a set of situations that demand inference based on varying amounts of data. For instance, we may try to assign a protein fold to a new protein sequence based on very remote homology to some sequence in the Protein Data Bank (PDB). In some cases, we may have many sequences related to our target sequence from the genome projects, and the multiple alignment of sequences can help us establish a homology to the PDB sequence that may in fact be quite remote. In other cases, we may have only a small number of homologs in the sequence database, and establishing a remote homology may be quite difficult. Another example arises in side-chain conformational analysis. For some ranges of φ and ψ many examples exist of residues in the PDB, and reasonable estimates for the three χ1 rotamer population can be calculated from the data. But the number of residues in some parts of the Ramachandran map is very small, but we would still like a good estimate of the three rotamer probabilities for protein folding simulations and comparative modeling [43].

Far from being a disadvantage, the need for a prior distribution can be a distinct advantage. First, it generally forces us to consider the full range of the possibilities for the hypothesis. Second, although in the absence of any prior information we can use uninformative priors, in some cases we may choose to use informative priors. Usually the prior corresponds in functional form to some number of data points, and we can choose to weight the prior in accordance with the strength of our belief in the prior information. Often a prior can be obtained by decoupling two or more parameters such that p(θ1, θ2)p(θ1)p(θ2). The two factors on the right-hand side might be obtained by pooling the data of y1 regardless of y2 and vice versa. In any case, the posterior is always a compromise between the prior and the likelihood. If the prior represents a larger sample than the data (our data set is quite small), then the prior will dominate the data. If the data sample is

322

Dunbrack

large, then the prior will have little effect on the posterior, which will resemble the likelihood. In cases where there are only a few data points, frequentist methods perform badly by sacrificing good short-term behavior in favor of good long-term behavior.

Another aspect in which Bayesian methods perform better than frequentist methods is in the treatment of nuisance parameters. Quite often there will be more than one parameter in the model but only one of the parameters is of interest. The other parameter is a nuisance parameter. If the parameter of interest is θ and the nuisance parameter is φ, then Bayesian inference on θ alone can be achieved by integrating the posterior distribution over φ. The marginal probability of θ is therefore

p(θ|y) Φ p(θ, φ|y)dφ

Φ p(y, φ)p(θ, φ)dφ

 

(5)

Θ,Φ p(y, φ)p(θ, φ)dφdθ

 

 

In frequentist statistics, by contrast, nuisance parameters are usually treated with point estimates, and inference on the parameter of interest is based on calculations with the nuisance parameter as a constant. This can result in large errors, because there may be considerable uncertainty in the value of the nuisance parameter.

In the next subsection, I describe how the basic elements of Bayesian analysis are formulated mathematically. I also describe the methods for deriving posterior distributions from the model, either in terms of conjugate prior likelihood forms or in terms of simulation using Markov chain Monte Carlo (MCMC) methods. The utility of Bayesian methods has expanded greatly in recent years because of the development of MCMC methods and fast computers. I also describe the basics of hierarchical and mixture models.

E.Setting Up Bayesian Models

Bayesian inference has three major components [44]:

1.Setting up the probability model for the data and parameters of the system under study. This entails defining prior distributions for all relevant parameters and a likelihood function for the data given the parameters.

2.Calculating the posterior distribution for the parameters given existing data. This calculation can sometimes be performed analytically, but in the general case it is performed by simulation.

3.Evaluating the model in terms of how well the model fits the data, including the use of posterior predictive simulations to determine whether data predicted from the posterior distribution resemble the data that generated them and look physically reasonable. Overfitting the data will produce unrealistic posterior predictive distributions.

The complexity of information that can be incorporated into the model gives Bayesian statistics much of its power.

We need a mathematical representation of our prior knowledge and a likelihood function to establish a model for any system to be analyzed. The calculation of the posterior distribution can be performed analytically in some cases or by simulation, which I

Bayesian Statistics

323

describe later. For an analytical solution it is usually the case that we need prior distribution forms that are conjugate to the likelihood function. If the prior distribution and the likelihood function are conjugate, then by definition the posterior distribution will have the same mathematical form as the prior distribution. A description of the important types follows.

1. Binomial and Multinomial Models

Any data set that consists of discrete classification into outcomes or descriptors is treated with a binomial (two outcomes) or multinomial (three or more outcomes) likelihood function. For example, if we have y successes from n experiments, e.g., y heads from n tosses of a coin or y green balls from a barrel filled with red and green balls in unknown proportions, the likelihood function is a binomial distribution:

p(y) Bin(y|n,θ)

n!

θy(1

θ)n y

(6)

 

y!(n y)!

An informative conjugate prior distribution can be formulated in terms of a beta distribution:

p(θ) Beta(θ|α, β)

Γ(α β)

θα 1(1

θ)β 1

(7)

Γ(α)Γ(β)

We can think of the beta distribution as the likelihood of α prior successes and β failures out of α β experiments. The Γ functions in front serve as a normalization constant, so that 10 p(θ) dθ 1. Note that for an integer, Γ(x 1) x! The posterior distribution that results from multiplying together the right-hand sides of Eqs. (2) and (3) is also a beta distribution:

p(θ|y) p(y)p(θ)/p(y) Beta(θ|α y, β n y)

 

Γ(n α β)

(8)

 

θα y 1θβ n y 1

Γ(α y)Γ(β n y)

 

 

We can see that the prior and posterior distributions have the same mathematical forms, as is required of conjugate functions. Also, we have an analytical form for the posterior, which is exact under the assumptions made in the model.

It is tempting to use Eq. (7) to derive Eq. (6), because they have similar forms given the relationship of the Γ function to the factorial. But the binomial and the beta distribution are not normalized in the same way. The beta is normalized over the values of θ, whereas the binomial is normalized over the counts, y given n. That is,

1

 

 

0

Beta(θ: α, β)dθ 1

(9a)

n

 

 

Bin(y: n, θ) 1

(9b)

y 0

It should be noted that the expected value of θ in a beta distribution is α/(α β), and the θ with maximum probability (the mode) is (α 1)/(α β 2). In terms of the expectation values, α and β behave as a total of α β counts even though the exponents are α 1 and β 1. The beta distribution is defined such that it is the expectation values and not the modes that correspond to counts of α and β.

324

Dunbrack

If there are more than two outcomes, we can use the multinomial distribution for the likelihood:

n!

 

p(y) i

yi! i θyi i

(10)

where mi 1yi n for m possible outcomes of n experiments. For example, if a barrel contains red, blue, and green balls, then m 3. We might make n 100 draws from the barrel (returning the balls to the barrel) and get 30 green, 50 red, and 20 blue balls. The conjugate prior distribution is the Dirichlet distribution,

 

 

m

 

 

 

 

Γ(x0)

x

1

 

p(θ) Dirichlet({xi})

 

i 1

θi i

 

(11)

i Γ(xi)

 

where x0 i 1xi. The Dirichlet distribution can be considered a generalization of the multinomial distribution, where the counts are no longer restricted to be integers. The values of the hyperparameters xi that define the prior distribution can be thought of as estimated counts for each of the m outcomes in some sample of size x0 mi 1xi. The total number of prior counts, x0, can be scaled to any value to alter the dependence of the posterior distribution on the prior distribution. The larger x0 is, the more precise the prior distribution is and the closer the posterior density is to values near θi xi/x0. The posterior distribution that results from Eqs. (10) and (11) is also Dirichlet with parameters xi yi, i.e.,

 

Γ(x0 y0)

x

y

1

 

p(θ|y)

 

 

i

θi i

i

 

(12)

i

Γ(xi yi)

 

Again, the use of the conjugate prior distribution results in the analytical form of the posterior distribution and therefore also simple expressions for the expectation values for the θi, their variances, covariances, and modes:

E(θi) xi yi x0 y0

mode(θi) xi yi 1 x0 y0 m

var(θi) E(θi)[1 E(θi)] x0 y0 1

cov(θi, θj) E(θi) E(θj) x0 y0 1

A noninformative prior distribution could be formed by setting each xi to 1.

(13a)

(13b)

(13c)

(13d)

Bayesian Statistics

325

2. Normal Models

The normal model can take a variety of forms depending on the choice of noninformative or informative prior distributions and on whether the variance is assumed to be a constant or is given its own prior distribution. And of course, the data could represent a single variable or could be multidimensional. Rather than describing each of the possible combinations, I give only the univariate normal case with informative priors on both the mean and variance. In this case, the likelihood for data y given the values of the parameters that comprise θ, µ (the mean), and σ2 (the variance) is given by the familiar exponential

p(y, σ)

1

exp

(y µ)2

 

(14)

σ√

 

2σ2

2π

This expression is valid for a single observation y. For multiple observations, we derive p(y) from the fact that p(y, σ2) Πi p(yi , σ2). The result is that the likelihood is also normal with the average value of y, y, substituted for y and σ2/n substituted for σ2 in Eq. (14). The conjugate prior distribution for Eq. (14) is

p(µ, σ2) p(µ|σ2)p(σ2)

(15)

 

1

 

 

κ0(µ µ0)2

 

2v0/2

v

/2

 

2 v

 

v0σ02

 

 

 

 

 

exp

 

 

 

σ00

 

σ

 

0 exp

 

 

σ√

 

 

2σ2

Γ(v0/2)

 

 

2σ2

2π

 

 

where the prior for µ is a normal distribution dependent on the value of σ2 as well as two hyperparameters, the mean µ0 and the scale κ0, while the prior for σ2 is a scaled inverse χ2 distribution [to the right of the sign in Eq. (15)] with two hyperparameters, the scale σ0 and degrees of freedom ν0. The posterior distribution that results from Eqs. (14) and (15) has the same form as the prior (because the prior is conjugate to the likelihood), so that [44]

2

 

1

 

exp

κn(µ µn)2

 

2vn/2

p(µ, σ

|y)

 

 

 

 

 

σ√

 

2σ2

Γ(vn/2)

2π

 

where

 

 

 

 

 

 

 

 

µn 1 (κ0µ0 ny)

κn

κn κ0 n,

νn ν0 n

 

 

 

2

2

 

2

 

nκ0

 

 

2

 

 

 

νnσn

ν0 σ0

(n 1)s

 

 

κn

 

(y µ0)

v

/2

 

2 v

n exp

vnσn2

 

σnn

 

σ

 

2σ2

n

s2 n 1 1 (yi y)2

i 1

(16)

(17a)

(17b)

(17c)

(17d)

Although this is a complicated expression, the results can be given a simple interpretation. The data sample size is n, whereas the prior sample size is κ0, and therefore µn is the weighted average of the prior ‘‘data’’ and the actual data. σ2n is the weighted average of the prior variance (σ20), the data variance (s2), and a term from the difference in the prior