Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang
.pdfPreface |
xi |
structure–activity relationship. A number of software packages for structure-based design are compared.
Chapter 17 (Protein Structure Prediction as a Systems Problem) provides a novel systematic view on solving the complex problem of protein structure prediction. It introduces consensus-based approach, pipeline approach, and expert system for predicting protein structure and for inferring protein functions. This chapter also discusses issues such as benchmark data and evaluation metrics. An example of protein structure prediction at genome-wide scale is also given.
Chapter 18 (Resources and Infrastructure for Structural Bioinformatics) describes tools, databases, and other resources of protein structure analysis and prediction available on the Internet. These include the PDB and related databases and servers, structural visualization tools, protein sequence and function databases, as well as resources for RNA structure modeling and prediction. It also gives information on major journals, professional societies, and conferences of the field.
Appendix 1 (Biological and Chemical Basics Related to Protein Structures) introduces central dogma of molecular biology, macromolecules in the cell (DNA, RNA, protein), amino acid residues, peptide chain, primary, secondary, tertiary, and quaternary structure of proteins, and protein evolution.
Appendix 2 (Computer Science for Structural Informatics) discusses computer science concepts that are essential for effective computation for protein structure prediction. These include efficient data structure, computational complexity and NP-hardness, various algorithmic techniques, parallel computing, and programming.
Appendix 3 (Physical and Chemical Basis for Structural Bioinformatics) covers basic concepts of our physical world, including unit system, coordinate systems, and energy surfaces. It also describes biochemical and biophysical concepts such as chemical reaction, peptide bonds, covalent bonds, hydrogen bonds, electrostatic interactions, van der Waals interactions, as well as hydrophobic interactions. In addition, this chapter discusses basic concepts from thermodynamics and statistical mechanics. Computational sampling techniques such as molecular dynamics and Monte Carlo method are also discussed.
Appendix 4 (Mathematics and Statistics for Studying Protein Structures) covers various basic concepts in mathematics and statistics, often used in structural bioinformatics studies such as probability distributions (uniform, Gaussian, binomial and multinomial, Dirichlet and gamma, extreme value distribution), basics of information theory including entropy, relative entropy, and mutual information, Markovian process and hidden Markov model, hypothesis testing, statistical inference (maximum likelihood, expectation maximization, and Bayesian approach), and statistical sampling (rejection sampling, Gibbs sampling, and Metropolis–Hastings algorithm).
Ying Xu
Dong Xu
Jie Liang
John Wooley
April 2006
Acknowledgments
During the editing of this book, we, the editors, have received tremendous help from many friends, colleagues, and families, to whom we would like to take this opportunity to express our deep gratitude and appreciation. First we would like to thank Dr. Eli Greenbaum of Oak Ridge National Laboratory, who encouraged us to start this book project and contacted the publisher at Springer on our behalf. We are very grateful to the following colleagues who have critically reviewed the drafts of the chapters of the book at various stages: Nick Alexandrov, Nir Ben-Tal, Natasja Brooijmans, Chris Bystroff, Pablo Chacon, Luonan Chen, Zhong Chen, Yong Duan, Roland Dunbrack, Daniel Fischer, Juntao Guo, Jaap Heringa, Xiche Hu, Ana Kitazono, Ioan Kosztin, Sandeep Kumar, Xiang Li, Guohui Lin, Zhijie Liu, Hui Lu, Alex Mackerell, Kunbin Qu, Robert C. Rizzo, Ilya Shindyalov, Ambuj Singh, Alex Tropsha, Iosif Vaisman, Ilya Vakser, Stella Veretnik, Björn Wallner, Jin Wang, Zhexin Xiang, Yang Dai, Xin Yuan, and Yaoqi Zhou. Their invaluable input on the scientific content, on the pedagogical style, and on the writing style helped to improve these book chapters significantly. We also want to thank Ms. Joan Yantko of the University of Georgia for her tireless help on numerous fronts in this book project, including taking care of a large number of email communications between the editors and the authors and chasing busy authors to get their revisions and other materials. Last but not least, we want to thank our families for their constant support and encouragement during the process of us working on this book project.
xiii
Contents
Contributors .............................................................................. |
xvii |
|
1 |
A Historical Perspective and Overview of Protein |
|
|
Structure Prediction .............................................................. |
1 |
|
John C. Wooley and Yuzhen Ye |
|
2 |
Empirical Force Fields ........................................................... |
45 |
|
Alexander D. MacKerell, Jr. |
|
3 |
Knowledge-Based Energy Functions for Computational |
|
|
Studies of Proteins................................................................. |
71 |
|
Xiang Li and Jie Liang |
|
4 |
Computational Methods for Domain Partitioning of |
|
|
Protein Structures ................................................................. |
125 |
|
Stella Veretnik and Ilya Shindyalov |
|
5 |
Protein Structure Comparison and Classification......................... |
147 |
|
Orhan C¸ amoglu˘ and Ambuj K. Singh |
|
6 |
Computation of Protein Geometry and Its Applications: |
|
|
Packing and Function Prediction.............................................. |
181 |
|
Jie Liang |
|
7 |
Local Structure Prediction of Proteins....................................... |
207 |
|
Victor A. Simossis and Jaap Heringa |
|
8 |
Protein Contact Map Prediction............................................... |
255 |
|
Xin Yuan and Christopher Bystroff |
|
9 |
Modeling Protein Aggregate Assembly and Structure ................... |
279 |
|
Jun-tao Guo, Carol K. Hall, Ying Xu, and Ronald B. Wetzel |
|
10 |
Homology-Based Modeling of Protein Structure .......................... |
319 |
|
Zhexin Xiang |
|
xv
xvi |
Contents |
11 Modeling Protein Structures Based on Density Maps |
|
at Intermediate Resolutions..................................................... |
359 |
Jianpeng Ma |
|
Index ........................................................................................ |
389 |
Contributors
Natasja Brooijmans
Chemical and Screening Sciences
Wyeth Research
Pearl River, New York 10965
Christopher Bystroff
Department of Biology
Rensselaer Polytechnic Institute
Troy, New York 12180
Liming Cai
Department of Computer Science
University of Georgia
Athens, Georgia 30602-7404
Orhan Camoglu
Department of Computer Science
University of California Santa Barbara
Santa Barbara, California 93106
Yang Dai
Department of Bioengineering
University of Illinois at Chicago
Chicago, Illinois 60607-7052
Haobo Guo
Department of Biochemistry and
Cellular and Molecular Biology
University of Tennessee
Knoxville, Tennessee 37996
Hong Guo
Department of Biochemistry and
Cellular and Molecular
Biology
University of Tennessee
Knoxville, Tennessee 37996
Jun-tao Guo
Department of Biochemistry and
Molecular Biology
University of Georgia
Athens, Georgia 30602-7229
Carol K. Hall
Department of Chemical and
Biomolecular Engineering
North Carolina State University
Raleigh, North Carolina 27695
Jaap Heringa
Centre for Integrative Bioinformatics Vrije Universiteit
1081 HV Amsterdam, The
Netherlands
xvii
xviii |
Contributors |
Xiche Hu
Department of Chemistry
University of Toledo
Toledo, Ohio 43606
Ling-Hong Hung
Department of Microbiology
University of Washington
Seattle, Washington 98195-7242
Xiang Li
Department of Bioengineering
University of Illinois at Chicago
Chicago, Illinois 60607-7052
Jie Liang
Department of Bioengineering
University of Illinois at Chicago
Chicago, Illinois 60607-7052
Guohui Lin
Department of Computing Science
University of Alberta
Edmonton, Alberta T6G 2E8, Canada
Zhijie Liu
Department of Biochemistry and
Molecular Biology
University of Georgia
Athens, Georgia 30602-7229
Hui Lu
Department of Bioengineering
University of Illinois at Chicago
Chicago, Illinois 60607-7052
Jianpeng Ma
Department of Biochemistry and
Molecular Biology
Baylor College of Medicine
Houston, Texas 77030
and
Department of Bioengineering
Rice University
Houston, Texas 77005
Alexander D. MacKerell, Jr.
Department of Pharmaceutical
Chemistry
School of Pharmacy
University of Maryland
Baltimore, Maryland 21201
Shing-Chung Ngan
Department of Microbiology
University of Washington
Seattle, Washington 98195-7242
Ognjen Periˇsi´c
Department of Bioengineering
University of Illinois at Chicago
Chicago, Illinois 60607-7052
Contributors |
xix |
Brian Pierce |
Stella Veretnik |
Department of Biomedical |
San Diego Supercomputer Center |
Engineering |
University of California San Diego |
Boston University |
San Diego, California 92093-0505 |
Boston, Massachusetts 02215 |
|
|
Zhiping Weng |
Kunbin Qu |
Department of Biomedical |
|
|
Department of Chemistry |
Engineering |
Rigel Pharmaceuticals, Inc. |
Boston University |
San Francisco, California 94080 |
Boston, Massachusetts 02215 |
Ram Samudrala
Department of Microbiology
University of Washington
Seattle, Washington 98195-7242
Ilya Shindyalov
San Diego Supercomputer Center
University of California San Diego
San Diego, California 92093-0505
Victor A. Simossis
Centre for Integrative Bioinformatics Vrije Universiteit
1081 HV Amsterdam, The Netherlands
Ambuj K. Singh
Department of Computer Science
University of California Santa Barbara
Santa Barbara, California 93106
Ronald B. Wetzel
Department of Structural Biology Pittsburgh Institute for
Neurodegenerative Diseases
University of Pittsburgh School of
Medicine
Pittsburgh, Pennsylvania 15260
John C. Wooley
Associate Vice Chancellor for Research
University of California San Diego
San Diego, California 92093-0043
Zhexin Xiang
Center for Molecular Modeling Center for Information Technology National Institutes of Health Bethesda, Maryland 20892-5624
xx |
Contributors |
|
Dong Xu |
Yuzhen Ye |
|
Computer Science Department |
Bioinformatics and Systems Biology |
|
University of Missouri—Columbia |
Department |
|
Columbia, Missouri 65211-2060 |
The Burnham Institute for Medical |
|
|
Research |
|
Ying Xu |
La Jolla, California 92037 |
|
Institute of Bioinformatics and |
Xin Yuan |
|
Department of Biochemistry |
||
|
||
and Molecular Biology |
Department of Computer Science |
|
University of Georgia |
Florida State University |
|
Athens, Georgia 30602-7229 |
Tallahassee, Florida 32306 |
1A Historical Perspective and Overview of Protein Structure Prediction
John C. Wooley and Yuzhen Ye
1.1 Introduction
Carrying on many different biological functions, proteins are all composed of one or more polypeptide chains, each containing from several to hundreds or even thousands of the 20 amino acids. During the 1950s at the dawn of modern biochemistry, an essential question for biochemists was to understand the structure and function of these polypeptide chains. The sequences of protein, also referred to as their primary structures, determine the different chemical properties for different proteins, and thus continue to captivate much of the attention of biochemists. As an early step in characterizing protein chemistry, British biochemist Frederick Sanger designed an experimental method to identify the sequence of insulin (Sanger et al., 1955). He became the first person to obtain the primary structure of a protein and in 1958 won his first Nobel Price in Chemistry. This important progress in sequencing did not answer the question of whether a single (individual) protein has a distinctive shape in three dimensions (3D), and if so, what factors determine its 3D architecture. However, during the period when Sanger was studying the primary structure of proteins, American biochemist Christian Anfinsen observed that the active polypeptide chain of a model protein, bovine pancreatic ribonuclease (RNase), could fold spontaneously into a unique 3D structure, which was later called native conformation of the protein (Anfinsen et al., 1954). Anfinsen also studied the refolding of RNase enzyme and observed that an enzyme unfolded under extreme chemical environment could refold spontaneously back into its native conformation upon changing the environment back to natural conditions (Anfinsen et al., 1961). By 1962, Anfinsen had developed his theory of protein folding (which was summarized in his 1972 Nobel acceptance speech): “The native conformation is determined by the totality of interatomic interactions and hence, by the amino acid sequence, in a given environment.”
Anfinsen’s theory of protein folding established the foundation for solving the protein structure prediction problem, i.e., for predicting the native conformation of a protein from its primary sequence, because all information needed to predict the native conformation is encoded in the sequence. The early approaches to solving this problem were based solely on the thermodynamics of protein folding. Scheraga and his colleagues applied several computer searching techniques to investigate the
1
2 |
John C. Wooley and Yuzhen Ye |
free energy of numerous local minimum energy conformations in an attempt to find the global minimum conformation, i.e., the thermodynamically most stable conformation of the protein (Gibson and Scheraga, 1967a,b; Scott et al., 1967). The major challenge for an energy minimization approach to protein structure prediction is that proteins are very flexible; thus, their potential conformation space is too large to be enumerated. [Despite the huge space of possible conformations, that proteins fold reliably and quickly to their native conformation is known as “Levinthal’s paradox” (Levinthal, 1968)]. To address this issue, one needs an accurate energy function to compute the energy for a given protein conformation and a rapid computer searching algorithm. The progress of peptide molecular mechanics enabled the development of molecular force fields that described the physical interactions between atoms using Newton’s equations of motion. In general, the interactions considered in the force field include covalent bonds and noncovalent interactions, such as electrostatic interactions, the van der Waals interactions, and, sometimes, hydrogen bonds and hydrophobic interactions. The parameters used in these force fields were obtained through experimental studies of small organic molecules. On the other hand, many computational methods developed in the field of optimization theory and mechanics have been applied to the rapid conformation search. These fall into two categories: the molecular dynamics method and the Brownian dynamics (or stochastic dynamics) method. Both methods sample a portion of potential protein conformations and evaluate their free energy. Molecular dynamics samples the conformations by simulating the protein motion based on Newton’s equation, starting from an arbitrarily chosen protein conformation. Brownian dynamics, instead, uses Monte Carlo random sampling technique or its derivatives to evaluate protein conformations. Combining various force fields and conformation searching methods, many software packages were developed, such as AMBER (Pearlman et al., 1995), CHARMM (Brooks et al., 1983) and GROMOS (van Gunsteren and Berendsen, 1990), all aimed at using computing simulations to predict the native conformation of proteins.
Despite the great theoretic interest in energy minimization methods, these have not been very successful in practice, because of the huge search space for potential protein conformations. In 1975, Levitt and Warshel used a simplified protein structure representation and successfully folded a small protein [bovine pancreatic trypsin inhibitor, (BPTI), 58 amino acid residues] into its native conformation from an open-chain conformation using energy minimization (Levitt and Warshel, 1975). Little progress, however, has been made since then; the simulation usually takes an unrealistic compute or run time, and the final prediction is not very satisfactory. For instance, in 1998, Duan and Kollman reported a simulation experiment of one small protein (the villin headpiece subdomain, 36 amino acid residues), running on a Cray T3D and then a Cray T3E supercomputer, that took months of computation with the entire machine dedicated to the problem (Duan and Kollman, 1998). Even though the resulting structure is reasonably folded and shows some resemblance to the native structure, the simulated and native structure did not completely match. Currently, energy minimization methods are largely used to refine a low-resolution initial structure obtained by experimental methods or by comparative modeling (Levitt and Lifson, 1969).