- •Foreword
- •Preface
- •Contents
- •Introduction
- •Oren M. Becker
- •Alexander D. MacKerell, Jr.
- •Masakatsu Watanabe*
- •III. SCOPE OF THE BOOK
- •IV. TOWARD A NEW ERA
- •REFERENCES
- •Atomistic Models and Force Fields
- •Alexander D. MacKerell, Jr.
- •II. POTENTIAL ENERGY FUNCTIONS
- •D. Alternatives to the Potential Energy Function
- •III. EMPIRICAL FORCE FIELDS
- •A. From Potential Energy Functions to Force Fields
- •B. Overview of Available Force Fields
- •C. Free Energy Force Fields
- •D. Applicability of Force Fields
- •IV. DEVELOPMENT OF EMPIRICAL FORCE FIELDS
- •B. Optimization Procedures Used in Empirical Force Fields
- •D. Use of Quantum Mechanical Results as Target Data
- •VI. CONCLUSION
- •REFERENCES
- •Dynamics Methods
- •Oren M. Becker
- •Masakatsu Watanabe*
- •II. TYPES OF MOTIONS
- •IV. NEWTONIAN MOLECULAR DYNAMICS
- •A. Newton’s Equation of Motion
- •C. Molecular Dynamics: Computational Algorithms
- •A. Assigning Initial Values
- •B. Selecting the Integration Time Step
- •C. Stability of Integration
- •VI. ANALYSIS OF DYNAMIC TRAJECTORIES
- •B. Averages and Fluctuations
- •C. Correlation Functions
- •D. Potential of Mean Force
- •VII. OTHER MD SIMULATION APPROACHES
- •A. Stochastic Dynamics
- •B. Brownian Dynamics
- •VIII. ADVANCED SIMULATION TECHNIQUES
- •A. Constrained Dynamics
- •C. Other Approaches and Future Direction
- •REFERENCES
- •Conformational Analysis
- •Oren M. Becker
- •II. CONFORMATION SAMPLING
- •A. High Temperature Molecular Dynamics
- •B. Monte Carlo Simulations
- •C. Genetic Algorithms
- •D. Other Search Methods
- •III. CONFORMATION OPTIMIZATION
- •A. Minimization
- •B. Simulated Annealing
- •IV. CONFORMATIONAL ANALYSIS
- •A. Similarity Measures
- •B. Cluster Analysis
- •C. Principal Component Analysis
- •REFERENCES
- •Thomas A. Darden
- •II. CONTINUUM BOUNDARY CONDITIONS
- •III. FINITE BOUNDARY CONDITIONS
- •IV. PERIODIC BOUNDARY CONDITIONS
- •REFERENCES
- •Internal Coordinate Simulation Method
- •Alexey K. Mazur
- •II. INTERNAL AND CARTESIAN COORDINATES
- •III. PRINCIPLES OF MODELING WITH INTERNAL COORDINATES
- •B. Energy Gradients
- •IV. INTERNAL COORDINATE MOLECULAR DYNAMICS
- •A. Main Problems and Historical Perspective
- •B. Dynamics of Molecular Trees
- •C. Simulation of Flexible Rings
- •A. Time Step Limitations
- •B. Standard Geometry Versus Unconstrained Simulations
- •VI. CONCLUDING REMARKS
- •REFERENCES
- •Implicit Solvent Models
- •II. BASIC FORMULATION OF IMPLICIT SOLVENT
- •A. The Potential of Mean Force
- •III. DECOMPOSITION OF THE FREE ENERGY
- •A. Nonpolar Free Energy Contribution
- •B. Electrostatic Free Energy Contribution
- •IV. CLASSICAL CONTINUUM ELECTROSTATICS
- •A. The Poisson Equation for Macroscopic Media
- •B. Electrostatic Forces and Analytic Gradients
- •C. Treatment of Ionic Strength
- •A. Statistical Mechanical Integral Equations
- •VI. SUMMARY
- •REFERENCES
- •Steven Hayward
- •II. NORMAL MODE ANALYSIS IN CARTESIAN COORDINATE SPACE
- •B. Normal Mode Analysis in Dihedral Angle Space
- •C. Approximate Methods
- •IV. NORMAL MODE REFINEMENT
- •C. Validity of the Concept of a Normal Mode Important Subspace
- •A. The Solvent Effect
- •B. Anharmonicity and Normal Mode Analysis
- •VI. CONCLUSIONS
- •ACKNOWLEDGMENT
- •REFERENCES
- •Free Energy Calculations
- •Thomas Simonson
- •II. GENERAL BACKGROUND
- •A. Thermodynamic Cycles for Solvation and Binding
- •B. Thermodynamic Perturbation Theory
- •D. Other Thermodynamic Functions
- •E. Free Energy Component Analysis
- •III. STANDARD BINDING FREE ENERGIES
- •IV. CONFORMATIONAL FREE ENERGIES
- •A. Conformational Restraints or Umbrella Sampling
- •B. Weighted Histogram Analysis Method
- •C. Conformational Constraints
- •A. Dielectric Reaction Field Approaches
- •B. Lattice Summation Methods
- •VI. IMPROVING SAMPLING
- •A. Multisubstate Approaches
- •B. Umbrella Sampling
- •C. Moving Along
- •VII. PERSPECTIVES
- •REFERENCES
- •John E. Straub
- •B. Phenomenological Rate Equations
- •II. TRANSITION STATE THEORY
- •A. Building the TST Rate Constant
- •B. Some Details
- •C. Computing the TST Rate Constant
- •III. CORRECTIONS TO TRANSITION STATE THEORY
- •A. Computing Using the Reactive Flux Method
- •B. How Dynamic Recrossings Lower the Rate Constant
- •IV. FINDING GOOD REACTION COORDINATES
- •A. Variational Methods for Computing Reaction Paths
- •B. Choice of a Differential Cost Function
- •C. Diffusional Paths
- •VI. HOW TO CONSTRUCT A REACTION PATH
- •A. The Use of Constraints and Restraints
- •B. Variationally Optimizing the Cost Function
- •VII. FOCAL METHODS FOR REFINING TRANSITION STATES
- •VIII. HEURISTIC METHODS
- •IX. SUMMARY
- •ACKNOWLEDGMENT
- •REFERENCES
- •Paul D. Lyne
- •Owen A. Walsh
- •II. BACKGROUND
- •III. APPLICATIONS
- •A. Triosephosphate Isomerase
- •B. Bovine Protein Tyrosine Phosphate
- •C. Citrate Synthase
- •IV. CONCLUSIONS
- •ACKNOWLEDGMENT
- •REFERENCES
- •Jeremy C. Smith
- •III. SCATTERING BY CRYSTALS
- •IV. NEUTRON SCATTERING
- •A. Coherent Inelastic Neutron Scattering
- •B. Incoherent Neutron Scattering
- •REFERENCES
- •Michael Nilges
- •II. EXPERIMENTAL DATA
- •A. Deriving Conformational Restraints from NMR Data
- •B. Distance Restraints
- •C. The Hybrid Energy Approach
- •III. MINIMIZATION PROCEDURES
- •A. Metric Matrix Distance Geometry
- •B. Molecular Dynamics Simulated Annealing
- •C. Folding Random Structures by Simulated Annealing
- •IV. AUTOMATED INTERPRETATION OF NOE SPECTRA
- •B. Automated Assignment of Ambiguities in the NOE Data
- •C. Iterative Explicit NOE Assignment
- •D. Symmetrical Oligomers
- •VI. INFLUENCE OF INTERNAL DYNAMICS ON THE
- •EXPERIMENTAL DATA
- •VII. STRUCTURE QUALITY AND ENERGY PARAMETERS
- •VIII. RECENT APPLICATIONS
- •REFERENCES
- •II. STEPS IN COMPARATIVE MODELING
- •C. Model Building
- •D. Loop Modeling
- •E. Side Chain Modeling
- •III. AB INITIO PROTEIN STRUCTURE MODELING METHODS
- •IV. ERRORS IN COMPARATIVE MODELS
- •VI. APPLICATIONS OF COMPARATIVE MODELING
- •VII. COMPARATIVE MODELING IN STRUCTURAL GENOMICS
- •VIII. CONCLUSION
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Roland L. Dunbrack, Jr.
- •II. BAYESIAN STATISTICS
- •A. Bayesian Probability Theory
- •B. Bayesian Parameter Estimation
- •C. Frequentist Probability Theory
- •D. Bayesian Methods Are Superior to Frequentist Methods
- •F. Simulation via Markov Chain Monte Carlo Methods
- •III. APPLICATIONS IN MOLECULAR BIOLOGY
- •B. Bayesian Sequence Alignment
- •IV. APPLICATIONS IN STRUCTURAL BIOLOGY
- •A. Secondary Structure and Surface Accessibility
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Computer Aided Drug Design
- •Alexander Tropsha and Weifan Zheng
- •IV. SUMMARY AND CONCLUSIONS
- •REFERENCES
- •Oren M. Becker
- •II. SIMPLE MODELS
- •III. LATTICE MODELS
- •B. Mapping Atomistic Energy Landscapes
- •C. Mapping Atomistic Free Energy Landscapes
- •VI. SUMMARY
- •REFERENCES
- •Toshiko Ichiye
- •II. ELECTRON TRANSFER PROPERTIES
- •B. Potential Energy Parameters
- •IV. REDOX POTENTIALS
- •A. Calculation of the Energy Change of the Redox Site
- •B. Calculation of the Energy Changes of the Protein
- •B. Calculation of Differences in the Energy Change of the Protein
- •VI. ELECTRON TRANSFER RATES
- •A. Theory
- •B. Application
- •REFERENCES
- •Fumio Hirata and Hirofumi Sato
- •Shigeki Kato
- •A. Continuum Model
- •B. Simulations
- •C. Reference Interaction Site Model
- •A. Molecular Polarization in Neat Water*
- •B. Autoionization of Water*
- •C. Solvatochromism*
- •F. Tautomerization in Formamide*
- •IV. SUMMARY AND PROSPECTS
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Nucleic Acid Simulations
- •Alexander D. MacKerell, Jr.
- •Lennart Nilsson
- •D. DNA Phase Transitions
- •III. METHODOLOGICAL CONSIDERATIONS
- •A. Atomistic Models
- •B. Alternative Models
- •IV. PRACTICAL CONSIDERATIONS
- •A. Starting Structures
- •C. Production MD Simulation
- •D. Convergence of MD Simulations
- •WEB SITES OF INTEREST
- •REFERENCES
- •Membrane Simulations
- •Douglas J. Tobias
- •II. MOLECULAR DYNAMICS SIMULATIONS OF MEMBRANES
- •B. Force Fields
- •C. Ensembles
- •D. Time Scales
- •III. LIPID BILAYER STRUCTURE
- •A. Overall Bilayer Structure
- •C. Solvation of the Lipid Polar Groups
- •IV. MOLECULAR DYNAMICS IN MEMBRANES
- •A. Overview of Dynamic Processes in Membranes
- •B. Qualitative Picture on the 100 ps Time Scale
- •C. Incoherent Neutron Scattering Measurements of Lipid Dynamics
- •F. Hydrocarbon Chain Dynamics
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Appendix: Useful Internet Resources
- •B. Molecular Modeling and Simulation Packages
- •Index
Bayesian Statistics |
317 |
||
p(θ|y) |
p(y|θ)p(θ) |
(4) |
|
Θ p(y|θ)p(θ)dθ |
|||
|
|
where the sum over hypotheses in Eq. (2a) has been replaced with an integral over Θ, the allowed values of the parameter θ. For example, if θ is a proportion, then the integral is from 0 to 1. The prior probability p(θ) is a prior probability distribution, a continuous function of θ that expresses any knowledge we have about likely values for θ before we consider the new data at hand. This probability distribution may be flat over the entire range of θ. Such a prior is referred to as uninformative (as with A, B, C’s equal prior probability of tenure above). Or we may have some prior data so that we know that θ is more likely to be in a particular range of Θ than outside that range, so the probability should be higher in that range than outside it. This is an informative prior. Under ordinary circumstances, the prior should be proper, that is, normalized to integrate to 1 with respect to θ: ∫Θ p(θ) dθ 1. The likelihood, by contrast, should be normalized with respect to integration over the data for given θ: ∫y p(y|θ)dy 1. If y is discrete (count data), then this is a sum over all possible y. The posterior distribution p(θ|y) is a continuous function of the parameter θ for known data y. It can be used for any inference on the value of θ desired. For instance, the probability that θ is greater than some value θ0 is P(θ θ0 |y)∫θθmax0 p(θ|y). Ninety-five percent confidence intervals can be calculated easily by choosing θ1 and θ2 such that ∫θθ21 p(θ|y) dθ 0.95. This alternative notation for parameters and data is common in the statistical literature, and we will use it throughout the rest of this chapter.
As an example of a continuous variable, we can calculate a posterior distribution for the proportion of red and green balls in a barrel sitting before us. You are asked to bet whether it is the red balls or the green balls that are more plentiful in the barrel. You have $10 in your possession, and you decide to bet in proportion to your certainty in your opinion on the red or green majority. You get a very quick look at the open barrel and estimate that the number of red and green balls look approximately equal. As a prior probability you would use a probability distribution that is a bell-shaped curve centered around θred of 0.5. (The appropriate mathematical form is a beta distribution, described below.) But your look was very brief, and you do not have a lot of confidence in this prior view. You are given a sample of balls from the barrel, which consists of 7 red balls and 3 green balls. It is clear you should bet on red, but how much? The likelihood function in this case is a binomial distribution. It gives the probability of nred and ngreen for a sample of N nred ngreen balls, given θred and θgreen 1 θred. The results are shown in Figure 1. The dashed line gives the posterior distribution of θred, the solid line gives the prior distribution, and the dotted line gives the likelihood of the data. The figure shows that the posterior is a compromise between the prior and the likelihood. We can integrate the posterior distribution to decide the amount of the bet from
1 p(θred |nred 7; ngreen 3)dθred 0.72 0.5
So we bet $7.20 on red.
C. Frequentist Probability Theory
It should be noted that the Bayesian conception of probability of a hypothesis and the Bayesian procedure for assessing this probability is the original paradigm for probabilistic
318 |
Dunbrack |
Figure 1 (——) Prior, ( ) data (likelihood), and (---) posterior distributions for estimating a proportion of red and green balls in a barrel. The prior is based on a sample of 40 balls with 20 of them red, Beta (20,20). The likelihood is shown for a draw of 10 balls from the barrel, seven of them red. The posterior distribution is Beta (27,23).
inference. Both Bayes [32] and Laplace [33] used Bayes’ rule to make probability statements for hypothetical statements given observed data, p(H|D), by inverting the likelihood function, p(D|H) [or, more accurately, determining p(θ|y) by inverting p(y|θ)]. But in the nineteenth and early twentieth centuries the idea that a hypothesis could have a probability associated with it seemed too subjective to practitioners of the new science of statistics. They were also uncomfortable with the notion of prior probabilities, especially when applied to continuous parameters, H θ. To deal with these problems, they simply removed these elements of statistical reasoning.
In frequentist statistics, probability is instead a long-run relative occurrence of some event out of an infinite number of repeated trials, where the event is a possible outcome of the trial.* A hypothesis or parameter that expresses a state of nature cannot have a probability in frequentist statistics, because after an infinite number of experiments there can be no uncertainty in the parameter left. A hypothesis or parameter value is either
* The field of statistics as a separate discipline began in the early to mid nineteenth century among German, British, French, and Belgian social reformers, referred to as statists: i.e., those that were concerned with numbers related to the state, including crime rates, income distributions, etc. [34] The appeal of frequency-based interpretation of probability would have been natural in the study of large human populations.
Bayesian Statistics |
319 |
always true or always false. Because a hypothesis cannot have a probability, frequentist probability is restricted to inference about data given a single hypothesis or a single value for the parameter. Data can be assigned probabilities because they can have varying values due to experimental error, even under a constant hypothesis.
The usual frequentist procedure comprises a number of steps [11]:
1.Define a procedure Π for selecting a hypothesis based on some characteristic(s) of the data, S(y). S is called a statistic. Often this will be a null hypothesis that
is deliberately at odds with S(y). So, for instance, whereas the characteristic of the data might be the sample average y or the variance σ2, the hypothesis H might be that the parameter θ (what y is measuring) has a value of 0.* Or if we are trying to show that two samples taken under different conditions are really different, we might define H to be the statement that the sample averages are in fact equal.
2.Because the data y are random, the statistics based on y, S(y), are also random.
For all possible data y (usually simulated) that can be predicted from H, calculate
p(S(ysim)|H), the probability distribution of the statistic S on simulated data ysim given the truth of the hypothesis H. If H is the statement that θ 0, then ysim might be generated by averaging samples of size N (a characteristic of the actual
data) with variance σ2 σ2 (yactual) (yet another characteristic of the data).
3.Compare the statistic S calculated on the actual data yactual to the distribution p(S(ysim)|H). If ∫∞ S(yactual) p(S(ysim)|H) dS(ysim) is very small ( 0.05, for in-
stance), then reject the hypothesis H. If S (yactual) falls just to the right of 95% of the simulated S(y), then we can conclude that θ 0 by rejecting the null hypothesis.†
If we do this over and over again, we will have done the right thing 95% of the time. Of course, we do not yet know the probability that, say, θ 5. For this purpose, confidence intervals for θ can be calculated that will contain the true value of θ 95% of the time, given many repetitions of the experiment. But frequentist confidence intervals are actually defined as the range of values for the data average that would arise 95% of the time from a single value of the parameter. That is, for normally distributed data,
Pr y √σN θ y √σN 0.95
Bayesian confidence intervals, by contrast, are defined as the interval in which.
Pr θ √σN y θ √σN 0.95
* Note that the bar above y in y in this section denotes the average of y. A bar over a statement or hypothesis A in the previous section was used to denote not-A. Both of these are standard notations in statistics and probability theory, respectively.
†The futility of frequentist testing of a point null hypothesis has been examined at length by Berger and Delampady [35].
320 |
Dunbrack |
The frequentist interval is often interpreted as if it were the Bayesian interval, but it is fundamentally defined by the probability of the data values given the parameter and not the probability of the parameter given the data.
D. Bayesian Methods Are Superior to Frequentist Methods
The Bayesian and frequentist theories can be considered two competing paradigms (in the sense of Kuhn [36]) of what the word ‘‘probability’’ means and how we should make inferential statements about hypotheses given data [37]. It is an unusual situation in the history of science that there should be two competing views on such basic notions, both of which have sizable entrenched camps of adherents, and that the controversy has lasted so long. Bayesian views fell out of favor until the book of Jeffreys in 1939 [38] and the work of Jaynes [30] and Savage [39] in the 1950s. Since then the Bayesian camp has increased tremendously in size. Because of some of the computational difficulties in evaluating posterior distributions, the advent of Markov chain Monte Carlo methods and fast computers has vastly increased the power and flexibility of Bayesian methods [40]. It is impossible to review the controversy in great depth here (see Refs. 9–11 and 37), but I will make some arguments in favor of the Bayesian view for molecular and structural biology.
1.The Bayesian View of Probability Corresponds to Most Scientists’ Thinking
Bayesian inference is a process of taking current views of the probability that competing hypotheses about nature might be true and updating these beliefs in light of new data. It corresponds to the daily experience of scientists and nonscientists alike of judgments made on past experience and present facts. As Good [41] has argued, all animals must be at least informal Bayesians, even non-Bayesian statisticians, because they evaluate possibilities and outcomes of their actions in light of their probabilities for success. One might argue that sane dogs are better Bayesians than humans, given our propensity for foolish and destructive behavior, regardless of prior experience.
Bayesian methods are quite similar to most scientists’ basic intuition about the nature of data and fundamental physical processes. The use of the prior distribution formalizes what we do naturally when we restrict our view of any parameter to a certain range of values or a certain number of outcomes and when we treat outliers with suspicion. If we get a value for an experiment that yields a parameter value that seems absurd compared to previous experiments, we are likely to repeat the experiment rather than publish the absurd result. Implicitly, we set the prior distribution of the parameters that are outside our range of interest or unphysical or unlikely on some ground to 0. The remaining probability density of the parameters must lie within the range we believe to be at least remotely possible and integrates to 1 within this range. Although basic statistics textbooks spend much time discussing null hypotheses, t-tests, F tests, etc., scientists rarely use this language in assessing data and inferred values of parameters based on the data. We do not usually ask whether the value of some parameter is exactly θ0 or simply whether it is greater than some value θ0. Rather we want to know what is the most likely range of values for the parameter, which is inherently a probability distribution over the parameter.
By contrast, the frequentist view is often contrary to common sense and common scientific practice. The classic example of this is the stopping rule problem [42]. If I am
Bayesian Statistics |
321 |
trying to decide whether a coin is fair, I can set up an experiment in two different ways. I can throw the coin some number of times N and then record the number of heads. Or I can keep throwing the coin until I observe some preset number of heads, nheads. Suppose one person throws the coin N 100 times and observes nheads 55 and another person throws the coin until 55 heads are observed and in fact it takes 100 throws. To a frequentist, these are quite different experiments and result in different inferences on the fairness of the coin. This is because the experiment that would be repeated an infinite number of times to obtain pheads is different in each case (in the first case, a binomial distribution is used to make the inference; in the second case, a negative binomial is used). This seems absurd to most people, and to a Bayesian the probability of a fair coin is the same, because the data are in fact the same in each case.
The previous example also highlights the controversy of the subjectivity or objectivity of the two competing views of probability. Frequentist probability theory arose because of the apparent subjectivity of Bayesian prior distributions. But it replaced this subjectivity with a set of procedures based on test statistics. Inference is based on the probability of the test statistic, calculated on hypothetical data consistent with the null hypothesis, being as large as or larger than the test statistic calculated on the real data. But the stopping rule example exemplifies that the nature of the repeated experiment, the sampling space, is in itself arbitrary, as is the functional form of the test statistic. It is the emphasis on data not seen that makes frequentist statistics unintuitive and gives statistics much of its reputation of being difficult for beginning students.
2. Bayesian Methods Perform Better
In biology we are often faced with a set of situations that demand inference based on varying amounts of data. For instance, we may try to assign a protein fold to a new protein sequence based on very remote homology to some sequence in the Protein Data Bank (PDB). In some cases, we may have many sequences related to our target sequence from the genome projects, and the multiple alignment of sequences can help us establish a homology to the PDB sequence that may in fact be quite remote. In other cases, we may have only a small number of homologs in the sequence database, and establishing a remote homology may be quite difficult. Another example arises in side-chain conformational analysis. For some ranges of φ and ψ many examples exist of residues in the PDB, and reasonable estimates for the three χ1 rotamer population can be calculated from the data. But the number of residues in some parts of the Ramachandran map is very small, but we would still like a good estimate of the three rotamer probabilities for protein folding simulations and comparative modeling [43].
Far from being a disadvantage, the need for a prior distribution can be a distinct advantage. First, it generally forces us to consider the full range of the possibilities for the hypothesis. Second, although in the absence of any prior information we can use uninformative priors, in some cases we may choose to use informative priors. Usually the prior corresponds in functional form to some number of data points, and we can choose to weight the prior in accordance with the strength of our belief in the prior information. Often a prior can be obtained by decoupling two or more parameters such that p(θ1, θ2)p(θ1)p(θ2). The two factors on the right-hand side might be obtained by pooling the data of y1 regardless of y2 and vice versa. In any case, the posterior is always a compromise between the prior and the likelihood. If the prior represents a larger sample than the data (our data set is quite small), then the prior will dominate the data. If the data sample is
322 |
Dunbrack |
large, then the prior will have little effect on the posterior, which will resemble the likelihood. In cases where there are only a few data points, frequentist methods perform badly by sacrificing good short-term behavior in favor of good long-term behavior.
Another aspect in which Bayesian methods perform better than frequentist methods is in the treatment of nuisance parameters. Quite often there will be more than one parameter in the model but only one of the parameters is of interest. The other parameter is a nuisance parameter. If the parameter of interest is θ and the nuisance parameter is φ, then Bayesian inference on θ alone can be achieved by integrating the posterior distribution over φ. The marginal probability of θ is therefore
p(θ|y) Φ p(θ, φ|y)dφ |
Φ p(y|θ, φ)p(θ, φ)dφ |
|
(5) |
|
Θ,Φ p(y|θ, φ)p(θ, φ)dφdθ |
||||
|
|
In frequentist statistics, by contrast, nuisance parameters are usually treated with point estimates, and inference on the parameter of interest is based on calculations with the nuisance parameter as a constant. This can result in large errors, because there may be considerable uncertainty in the value of the nuisance parameter.
In the next subsection, I describe how the basic elements of Bayesian analysis are formulated mathematically. I also describe the methods for deriving posterior distributions from the model, either in terms of conjugate prior likelihood forms or in terms of simulation using Markov chain Monte Carlo (MCMC) methods. The utility of Bayesian methods has expanded greatly in recent years because of the development of MCMC methods and fast computers. I also describe the basics of hierarchical and mixture models.
E.Setting Up Bayesian Models
Bayesian inference has three major components [44]:
1.Setting up the probability model for the data and parameters of the system under study. This entails defining prior distributions for all relevant parameters and a likelihood function for the data given the parameters.
2.Calculating the posterior distribution for the parameters given existing data. This calculation can sometimes be performed analytically, but in the general case it is performed by simulation.
3.Evaluating the model in terms of how well the model fits the data, including the use of posterior predictive simulations to determine whether data predicted from the posterior distribution resemble the data that generated them and look physically reasonable. Overfitting the data will produce unrealistic posterior predictive distributions.
The complexity of information that can be incorporated into the model gives Bayesian statistics much of its power.
We need a mathematical representation of our prior knowledge and a likelihood function to establish a model for any system to be analyzed. The calculation of the posterior distribution can be performed analytically in some cases or by simulation, which I
Bayesian Statistics |
323 |
describe later. For an analytical solution it is usually the case that we need prior distribution forms that are conjugate to the likelihood function. If the prior distribution and the likelihood function are conjugate, then by definition the posterior distribution will have the same mathematical form as the prior distribution. A description of the important types follows.
1. Binomial and Multinomial Models
Any data set that consists of discrete classification into outcomes or descriptors is treated with a binomial (two outcomes) or multinomial (three or more outcomes) likelihood function. For example, if we have y successes from n experiments, e.g., y heads from n tosses of a coin or y green balls from a barrel filled with red and green balls in unknown proportions, the likelihood function is a binomial distribution:
p(y|θ) Bin(y|n,θ) |
n! |
θy(1 |
θ)n y |
(6) |
|
y!(n y)!
An informative conjugate prior distribution can be formulated in terms of a beta distribution:
p(θ) Beta(θ|α, β) |
Γ(α β) |
θα 1(1 |
θ)β 1 |
(7) |
Γ(α)Γ(β) |
We can think of the beta distribution as the likelihood of α prior successes and β failures out of α β experiments. The Γ functions in front serve as a normalization constant, so that ∫10 p(θ) dθ 1. Note that for an integer, Γ(x 1) x! The posterior distribution that results from multiplying together the right-hand sides of Eqs. (2) and (3) is also a beta distribution:
p(θ|y) p(y|θ)p(θ)/p(y) Beta(θ|α y, β n y)
|
Γ(n α β) |
(8) |
|
|
θα y 1θβ n y 1 |
||
Γ(α y)Γ(β n y) |
|||
|
|
We can see that the prior and posterior distributions have the same mathematical forms, as is required of conjugate functions. Also, we have an analytical form for the posterior, which is exact under the assumptions made in the model.
It is tempting to use Eq. (7) to derive Eq. (6), because they have similar forms given the relationship of the Γ function to the factorial. But the binomial and the beta distribution are not normalized in the same way. The beta is normalized over the values of θ, whereas the binomial is normalized over the counts, y given n. That is,
1 |
|
|
0 |
Beta(θ: α, β)dθ 1 |
(9a) |
n |
|
|
Bin(y: n, θ) 1 |
(9b) |
y 0
It should be noted that the expected value of θ in a beta distribution is α/(α β), and the θ with maximum probability (the mode) is (α 1)/(α β 2). In terms of the expectation values, α and β behave as a total of α β counts even though the exponents are α 1 and β 1. The beta distribution is defined such that it is the expectation values and not the modes that correspond to counts of α and β.
324 |
Dunbrack |
If there are more than two outcomes, we can use the multinomial distribution for the likelihood:
n! |
|
|
p(y|θ) i |
yi! i θyi i |
(10) |
where ∑mi 1yi n for m possible outcomes of n experiments. For example, if a barrel contains red, blue, and green balls, then m 3. We might make n 100 draws from the barrel (returning the balls to the barrel) and get 30 green, 50 red, and 20 blue balls. The conjugate prior distribution is the Dirichlet distribution,
|
|
m |
|
|
|
|
Γ(x0) |
x |
1 |
|
|
p(θ) Dirichlet({xi}) |
|
i 1 |
θi i |
|
(11) |
i Γ(xi) |
|
where x0 ∑i 1xi. The Dirichlet distribution can be considered a generalization of the multinomial distribution, where the counts are no longer restricted to be integers. The values of the hyperparameters xi that define the prior distribution can be thought of as estimated counts for each of the m outcomes in some sample of size x0 ∑mi 1xi. The total number of prior counts, x0, can be scaled to any value to alter the dependence of the posterior distribution on the prior distribution. The larger x0 is, the more precise the prior distribution is and the closer the posterior density is to values near θi xi/x0. The posterior distribution that results from Eqs. (10) and (11) is also Dirichlet with parameters xi yi, i.e.,
|
Γ(x0 y0) |
x |
y |
1 |
|
||
p(θ|y) |
|
|
i |
θi i |
i |
|
(12) |
i |
Γ(xi yi) |
|
Again, the use of the conjugate prior distribution results in the analytical form of the posterior distribution and therefore also simple expressions for the expectation values for the θi, their variances, covariances, and modes:
E(θi) xi yi x0 y0
mode(θi) xi yi 1 x0 y0 m
var(θi) E(θi)[1 E(θi)] x0 y0 1
cov(θi, θj) E(θi) E(θj) x0 y0 1
A noninformative prior distribution could be formed by setting each xi to 1.
(13a)
(13b)
(13c)
(13d)
Bayesian Statistics |
325 |
2. Normal Models
The normal model can take a variety of forms depending on the choice of noninformative or informative prior distributions and on whether the variance is assumed to be a constant or is given its own prior distribution. And of course, the data could represent a single variable or could be multidimensional. Rather than describing each of the possible combinations, I give only the univariate normal case with informative priors on both the mean and variance. In this case, the likelihood for data y given the values of the parameters that comprise θ, µ (the mean), and σ2 (the variance) is given by the familiar exponential
p(y|µ, σ) |
1 |
exp |
(y µ)2 |
|
(14) |
|
σ√ |
|
2σ2 |
||||
2π |
This expression is valid for a single observation y. For multiple observations, we derive p(y|θ) from the fact that p(y|µ, σ2) Πi p(yi |µ, σ2). The result is that the likelihood is also normal with the average value of y, y, substituted for y and σ2/n substituted for σ2 in Eq. (14). The conjugate prior distribution for Eq. (14) is
p(µ, σ2) p(µ|σ2)p(σ2)
(15)
|
1 |
|
|
κ0(µ µ0)2 |
|
2v0/2 |
v |
/2 |
|
2 v |
|
v0σ02 |
|
|
|
|
|
|
exp |
|
|
|
σ00 |
|
σ |
|
0 exp |
|
|
σ√ |
|
|
2σ2 |
Γ(v0/2) |
|
|
2σ2 |
|||||||
2π |
|
|
where the prior for µ is a normal distribution dependent on the value of σ2 as well as two hyperparameters, the mean µ0 and the scale κ0, while the prior for σ2 is a scaled inverse χ2 distribution [to the right of the sign in Eq. (15)] with two hyperparameters, the scale σ0 and degrees of freedom ν0. The posterior distribution that results from Eqs. (14) and (15) has the same form as the prior (because the prior is conjugate to the likelihood), so that [44]
2 |
|
1 |
|
exp |
κn(µ µn)2 |
|
2vn/2 |
|
p(µ, σ |
|y) |
|
|
|
|
|
||
σ√ |
|
2σ2 |
Γ(vn/2) |
|||||
2π |
|
|||||||
where |
|
|
|
|
|
|
|
|
µn 1 (κ0µ0 ny)
κn
κn κ0 n, |
νn ν0 n |
|
|
|
|||||
2 |
2 |
|
2 |
|
nκ0 |
|
|
2 |
|
|
|
|
|||||||
νnσn |
ν0 σ0 |
(n 1)s |
|
|
κn |
|
(y µ0) |
v |
/2 |
|
2 v |
n exp |
vnσn2 |
|
σnn |
|
σ |
|
2σ2 |
n
s2 n 1 1 (yi y)2
i 1
(16)
(17a)
(17b)
(17c)
(17d)
Although this is a complicated expression, the results can be given a simple interpretation. The data sample size is n, whereas the prior sample size is κ0, and therefore µn is the weighted average of the prior ‘‘data’’ and the actual data. σ2n is the weighted average of the prior variance (σ20), the data variance (s2), and a term from the difference in the prior