Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Young - Computational chemistry

.pdf
Скачиваний:
77
Добавлен:
08.01.2014
Размер:
4.23 Mб
Скачать

21.3 SIMULATED ANNEALING

183

Several general rules of thumb can be given for doing very thorough Monte Carlo searches. The ®rst is to run 1000 to 2000 Monte Carlo steps for each degree of freedom in the molecule. It is best to perform a couple of minimization steps at each point in the search. The lowest-energy structures should be saved and minimized fully. This should be done with anywhere from a few to a few hundred of the lowest-energy conformers, depending on the complexity of the molecule. Combine the results of several searches, keeping the unique structures. This process is continued until no more unique structures are found.

21.3SIMULATED ANNEALING

A simulated annealing calculation has a greater level of sophistication than a Monte Carlo simulation. This algorithm is based on the observation that crystals are global energy minima for very complex systems. Crystals are typically formed by slowly decreasing the temperature. This is because the system will start out with enough energy to cross energy barriers from less favorable to more favorable con®gurations. As the temperature is lowered, the least favorable con®gurations are rendered energetically inaccessible and an increasingly larger number of molecules must populate the lowest-energy orientations.

A simulated annealing algorithm is a molecular dynamics simulation, in which the amount of kinetic energy in the molecule (the simulation temperature) slowly decreases over the course of the simulation. At the beginning of the simulation, many high-energy structures are being examined and high-energy barriers can be crossed. At the end of the calculation, only structures that are close to the best-known low-energy structures are examined, as depicted in Figure 21.4. This is generally an improvement over Monte Carlo methods in terms of the number of low-energy conformers found for a given amount of computation time.

There are similar algorithms, also called simulated annealing, that are Monte Carlo algorithms in which the choice conformations obey a Gaussian distribution centered on the lowest-energy value found thus far. The standard deviation of this distribution decreases over the course of the simulation.

In practice, simulated annealing is most e¨ective for ®nding low-energy conformers that are similar in shape to the starting geometry. A simulated annealing algorithm starts from a given geometry and has the ability to cross barriers to other conformers. If the global minimum is reasonably similar to this geometry, a simulated annealing algorithm will have a good chance of ®nding it. If there are a number of high-energy barriers between the starting geometry and global minimum, a simulated annealing algorithm will not be the most e½cient way to ®nd the global minimum. This is because it will require a long computation time, starting from a high temperature and cooling slowly in order to adequately cross those barriers and search all the conformation space. Monte Carlo methods will be a better choice when there are a large number of nearly equivalent minima with signi®cantly di¨erent structures. If the simula-

184 21 CONFORMATION SEARCHING

Energy

Conformation space

FIGURE 21.4 Sampling of conformation space using a simulated annealing algorithm: Looks like the Monte Carlo sampling in Figure 21.3 at the beginning of the simulation and like the sampling shown here at the end of the simulation.

tion is being run long enough with a su½ciently slow decrease in temperature, multiple runs starting from di¨erent geometries should give the same ending structure most of the time. Thus, the results of several identical simulations are examined to determine if the simulation was large enough to yield results that can be trusted.

21.4GENETIC ALGORITHMS

Genetic algorithms stem from the observation that the evolution process tends to produce increasingly well-adapted populations. In keeping with this model, there must be some set of numbers representing the conformation of the molecule that are referred to collectively as a ``chromosome.'' There must also be a criteria for measuring ®tness that is generally the conformational energy. At the start of the simulation, a ``population'' consisting of many di¨erent conformers is chosen randomly. Over the course of the simulation, these are combined to produce new conformers using algorithms that represent reproduction, replacement, and mutation.

The reproduction process includes functions for selection, replacement, crossover, mutation, and elitism. Selection ensures that individuals in one generation have a higher probability of being included in the reproduction process if they have a higher ®tness. Replacement is the copying of some individuals from one generation to another, thus resulting in some overlap between generations. Crossover is the process of individuals acquiring parts of their chro-

21.5 DISTANCE-GEOMETRY ALGORITHMS 185

mosomes from each parent. Mutation is accomplished by randomly changing a small percentage of the chromosomes in a generation. Elitism is the unnatural choice of copying the very best individual in the population into the next generation without mutation. All these operations can be de®ned as occurring on an array of numbers or a string of binary bits. As this process continues, the population becomes better adapted, as indicated by the fact that most of the population has the same value for one of its ``genes,'' which has converged. Once most of the genes have converged, the simulation is stopped and the most ®t individual is accepted as the global optimum.

There are a number of known problems with genetic algorithms, most of which can be somewhat corrected by more sophisticated variations on the basic algorithm. A few very ®t individuals can force the population to a premature convergence on a local minimum. Premature convergence can be prevented by requiring a certain amount of diversity in the population. All the conformational space will be sampled even if some of it is physically unreasonable. A well-constructed ®tness function can minimize this. The genetic algorithm only works well with multiple tests of many individuals and many generations; thus, these are inherently large simulations. The exact crossover algorithm can also have a signi®cant bearing on the ®nal results.

A genetic algorithm is generally an e¨ective way of generating a large number of low-energy conformers. However, there is no guarantee that a global minimum will be found. A number of tests have shown genetic algorithms to be superior to simulated annealing and grid searches.

21.5DISTANCE-GEOMETRY ALGORITHMS

The amount of computation necessary to try many conformers can be greatly reduced if a portion of the structure is known. One way to determine a portion of the structure experimentally is to obtain some of the internuclear distances from two-dimensional NMR experiments, as predicted by the nuclear Overhauser e¨ect (NOE). Once a set of distances are determined, they can be used as constraints within a conformation search. This has been particularly e¨ective for predicting protein structure since it is very di½cult to obtain crystallographic structures of proteins. It is also possible to de®ne distance constraints based on the average bond lengths and angles, if we assume these are fairly rigid while all conformations are accessible.

The experimentally determined correct distances are incorporated into the energy expression by de®ning a penalty function, which is zero for a reasonable range of correct distances and then increases outside of that range. Once this energy penalty has been de®ned, other search techniques such as Monte Carlo simulation and simulated annealing can be used. This technique has the added advantage of searching a space of conformers that are relevant to the experimental results, even when that might not be the global minimum.

If the molecular motion is faster than the NMR timescale, the distance pre-

186 21 CONFORMATION SEARCHING

dicted by an NOE measurement will be a time average only. In this case, it might be necessary to use a wider range of acceptable distances in order to incorporate all conformers consistent with the NOE data. Time-averaged constraint algorithms have been proposed, but these are di½cult to use e¨ectively.

21.6THE FRAGMENT APPROACH

One way to reduce computation time is to optimize one part of a molecule at a time. For example, the conformation of a t-butyl group is well known; thus, there is no need to expend computation time searching conformations of that group. This concept of fragmentation can be automated by constructing a database of molecular fragments in their lowest-energy conformation. An algorithm can then be designed to use those fragments as they appear in the database, thus only optimizing the conformations between fragments.

Another variation on this technique is to ®rst optimize the side chains and then keep the side chains ®xed while optimizing the backbone. In an extreme case, representing these ®xed side chains as large polygons with some net interaction potential can increase the calculation speed even more.

21.7CHAIN GROWTH

The fragment approach has sometimes been combined with a chain growth or buildup algorithm. The chain growth algorithm is one in which the full molecule is built up one unit at a time. As each unit (monomer or functional group) is added, its conformation is searched without changing the rest of the chain. This results in a CPU time requirement that is directly proportional to the number of units and is thus much faster than some of the other algorithms.

This is a fairly reasonable way to describe man-made amorphous polymers, which had not been given time to anneal. For polymers that form very quickly, a quick Monte Carlo search on addition can insert an amount of nonoptimal randomness, as is expected in the physical system.

This is not a good way to describe polymers that have been annealed to give a crystalline form or some crystalline domains. The chain growth algorithm as described above is a fairly poor way to describe biomolecules. This is because the core regions of biomolecules will not be solvent-accessible while the outer regions are solvent-accessible, something which is not taken into account by this simple algorithm. More sophisticated chain growth algorithms have been applied to biomolecular systems.

21.8RULE-BASED SYSTEMS

Rule-based systems try to identify certain subsequences of amino acids that tend to have a particular secondary structure, such as sheets, a-helices, b-strands,

21.9 USING HOMOLOGY MODELING 187

and so on. These sections can then be held rigid while the conformations of the connecting fragments are searched. The most successful way of doing this is by analyzing many known structures to determine which amino acids or sequences of amino acids have a statistical tendency to adopt each type of secondary structure. The known structures are also analyzed to determine which tend to initiate or terminate regions of a particular secondary structure. The drawback of this method is that there are exceptions to every rule, which can lead to false results.

21.9USING HOMOLOGY MODELING

Homology algorithms are based on ®nding similar molecules. These are most often applied to proteins since it is easy to automate the step of comparing the sequence of a protein with the sequences of known proteins. For example, the same protein from several di¨erent species will often have a very similar sequence and conformation. Once the similarity algorithm has been used to identify the most similar sequence, the geometry of the previously determined structure (called the template) can be used with the necessary substitutions to generate a very reasonable starting geometry (called the model). From this starting geometry, minimization algorithms or very limited conformation searches can be used. Homology techniques can also be applied to nucleic acids.

There are several signi®cant advantages to homology modeling compared to other conformation search techniques. Homology techniques are less computerintensive than the alternatives (but still not trivial). They can give high-quality results. Homology is also good at obtaining the tertiary structure.

There are also some disadvantages to homology modeling. It is still a relatively new technique. Manual intervention is necessary. The result is never perfect. Side chains and loops are di½cult to position. Worst of all is that homology cannot always be used since it depends on ®nding a known structure with a similar sequence. Overall, the advantages outweigh the disadvantages, making homology a very important technique in the pharmaceutical industry.

The degree of sequence similarity determines how much work is involved in homology modeling and how accurate the results can be expected to be. If similarity is greater than 90%, homology can match crystallography within the experimental error, with the biggest di¨erence in side chain rotations. A similarity of 75 to 90% can still result in very good results. At this point, the accuracy of the ®nal results is limited by the computer power available, mostly for model optimization. A similarity of 50 to 75% can be expected to result in an RMS error of about 1.5 AÊ , with large errors in some sections of the molecule. A case with a similarity of 25 to 50% can give adequate results with manual intervention. These cases are limited by the alignment algorithm. With a similarity of less than 25%, the homology is unreliable, and distance-geometry usually gives superior results.

188 21 CONFORMATION SEARCHING

The homology process consists of eight steps:

1.Template recognition

2.Alignment

3.Alignment correction

4.Backbone generation

5.Generation of loops

6.Side-chain generation

7.Model optimization

8.Model veri®cation

Template recognition is the process of ®nding the most similar sequence. The researcher must choose how to compute similarity. It is possible to run a fast, approximate search of many sequences or a slow, accurate search of a few sequences. Sequences that should be analyzed more carefully are the same protein from a di¨erent species, proteins with a similar function or from the same metabolic pathway, or a library of commonly observed substructures if available.

The alignment algorithm scores how well pairs match: high scores for identical pairs, medium scores for similar pairs (i.e., both hydrophobic), low scores for dissimilar pairs. Alignment is di½cult for low-similarity sequences, thus requiring manual alignment. Multiple-sequence alignment is the process of using sections from several template compounds.

Alignment correction is used to compensate for limitations of the alignment procedure. Alignment algorithms are based on primary structure only. Decisions between two options with medium scores are not completely reliable. Alignment correction reexamines these marginal cases by looking at the secondary and tertiary structure in that localized region. This may be completely manual, or at the very least it may require visual examination.

Backbone generation is the ®rst step in building a three-dimensional model of the protein. First, it is necessary to ®nd structurally conserved regions (SCR) in the backbone. Next, place them in space with an orientation and conformation best matching those of the template. Single amino acid exchanges are assumed not to a¨ect the tertiary structure. This often results in having sections of the model compound that are unconnected.

The generation of loops is necessary because disconnected regions are often separated by a section where a few amino acids have been inserted or omitted. These are often extra loops that can be determined by several methods. One method is to perform a database search to ®nd a similar loop and then use its geometric structure. Often, other conformation search methods are used. Manual structure building may be necessary in order to ®nd a conformation that connects the segments. Visual inspection of the result is recommended in any case.

21.10 HANDLING RING SYSTEMS

189

Side chain generation is often a source of error. It will be most reliable if certain rules of thumb are obeyed. Start with structurally conserved side chains and hold them ®xed. Then look at the energy and entropy of rotamers for the remaining side chains. Conventional conformation search techniques are often used to place each side chain.

Model optimization is a further re®nement of the secondary and tertiary structure. At a minimum, a molecular mechanics energy minimization is done. Often, molecular dynamics or simulated annealing are used. These are frequently chosen to search the region of conformational space relatively close to the starting structure. For marginal cases, this step is very important and larger simulations should be run.

Model veri®cation provides a common-sense check of results. One quick check is to compare the minimized energy to that of similar proteins. It is also important to examine the structure to ensure that hydrophobic groups point inward and hydrophilic groups point outward.

An ideal case would be one with high similarity, in which most di¨erences are single amino acid exchanges and the entire tertiary structure can be determined from the template. A good case is one in which sections of 10 to 30 pairs are conserved, and conserved regions are separated by short strings of inserted, omitted, or exchanged amino acids (to be examined in the loop generation phase). An acceptable case is one in which many small regions are identical and separated by single or double exchanges, or one in which larger sections are taken from di¨erent templates. Thus, alignment may be di½cult for automated algorithms and should be examined by hand. A poor case would be one in which there is low similarity, causing alignment algorithms to fail and requiring extensive model optimization. For this poor case, results are expected to be marginal even with considerable manual intervention.

21.10HANDLING RING SYSTEMS

Ring systems present a particular di½culty because many of the structures generated by a conformation search algorithm will correspond to a broken ring. Grid searches and Monte Carlo searches of ring systems are often done by ®rst breaking the ring and then only accepting conformations that put the two ends within a reasonable distance. Monte Carlo searches are also sometimes done in Cartesian coordinates.

Simulated annealing searches will ®nd conformations corresponding to ring changes, without unreasonably breaking the ring. There are also good algorithms for randomly displacing the atoms and then performing a minimization of the geometry. If the ring is one well understood, such as cyclohexane, it is sometimes most e½cient to simply try all the known chair and boat forms of the base compound.

190 21 CONFORMATION SEARCHING

21.11LEVEL OF THEORY

It is quite common to do the conformation search with a very fast method and to then optimize a collection of the lowest-energy conformers with a more accurate method. In some cases, single geometry calculations with more accurate methods are also performed. Solvent e¨ects may also be important as discussed in Chapter 24.

The simplest and most quickly computed models are those based solely on steric hindrance. Unfortunately, these are often too inaccurate to be trusted. Molecular mechanics methods are often the method of choice due to the large amount of computation time necessary. Semiempirical methods are sometimes used when molecular mechanics does not properly represent the molecule. Ab initio methods are only viable for the very smallest molecules. These are discussed in more detail in the applicable chapters and the sources mentioned in the bibliography.

21.12RECOMMENDED SEARCH ALGORITHMS

Since computation time is the most important bottleneck to conformation searching, the following list starts with the methods most amenable to the largest molecular systems:

1.Homology-based starting structures

2.Distance-geometry algorithms

3.Fragment-based algorithms

4.Chain growth algorithms where applicable

5.Rule-based systems

6.Genetic algorithms

7.Simulated annealing

8.Monte Carlo algorithms

9.Grid searches

BIBLIOGRAPHY

Sources, which compare conformation search algorithms are

F.Jensen, Introduction to Computational Chemistry John Wiley & Sons, New York (1999).

A. R. Leach Molecular Modelling Principles and Applications Longman, Essex (1996). M. VaÂsquez, G. NeÂmethy, H. A. Scheraga, Chem. Rev. 94, 2183 (1994).

H. A. Scheraga, Computer Simulation of Biomolecular Systems Voume 2 W. F. v. Gunsteren, P. K. Weiner, A. J. Wilkinson, Eds., ESCOM, Leiden (1993).

BIBLIOGRAPHY 191

H.A. Scheraga, Rev. Comput. Chem. 3, 73 (1992).

L. Ingber, B. Rosen, Mathl. Comput. Modelling 16, 87 (1992). A. R. Leach, Rev. Comput. Chem. 2, 1 (1991).

J. Skolnick, A. Kolinski, Ann. Rev. Phys. Chem. 40, 207 (1989).

Sources, which compare the accuracy of various levels of theory for describing conformational energy di¨erences, are

W. J. Hehre, Practical Strategies for Electronic Structure Calculations Wavfunction, Irvine (1995).

I. Pettersson, T. Liljefors, Rev. Comput. Chem. 9, 167 (1996).

I. N. Levine, Quantum Chemistry Fourth Edition Prentice Hall, Englewood Cli¨s (1991).

A.Golebiewski, A. Parczewski, Chem. Rev. 74, 519 (1974).

Review articles are

I.KolossvaÂry, W. C. Guida, Encycl. Comput. Chem. 1, 513 (1998). S. Vajda, Encycl. Comput. Chem. 1, 521 (1998).

E. L. Eliel, Encycl. Comput. Chem. 1, 531 (1998).

G. M. Crippen, Encycl. Comput. Chem. 1, 542 (1998).

F. Neumann, J. Michl, Encycl. Comput. Chem. 1, 556 (1998). R. L. Jernigan, I. Bahar, Encycl. Comput. Chem. 1, 562 (1998). J. E. Strab, Encycl. Comput. Chem. 3, 2184 (1998).

M. Saunders, Encycl. Comput. Chem. 4, 2948 (1998).

Monte Carlo algorithms are discussed in

M.H. Kalos, P. A. Witlock, Monte Carlo Methods Volume I: Basics Wiley Interscience, New York (1986).

N.Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, J. Chem. Phys. 21, 1087 (1953).

Simulated annealing is discussed in

S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi, Science 220, 671 (1983).

Genetic algorithms are discussed in

R. Judson, Rev. Comput. Chem. 10, 1 (1997).

Genetic Algorithms in Molecular Modelling J. Devillers, Ed., Academic Press, San Diego (1996).

C.B. Lucasius, G. Kateman, Chemometrics Intell. Lab. Syst. 19, 1 (1993).

D.B. Hibbert, Chemometrics Intell. Lab. Syst. 19, 277 (1993).

H. M. Cartwright, Applications of Arti®cial Intelligence in Chemistry Oxford, Oxford (1993).

T. Brodmeier, E. Pretsch, J. Comp. Chem. 15, 588 (1994).

192 21 CONFORMATION SEARCHING

Distance-geometry algorithms are reviewed in

A. E. Torda, W. F. v. Gunsteren, Rev. Comput. Chem 3, 143 (1992). M.-P. Williamson, J. P. Walto, Chem. Soc. Rev. 21, 227 (1992).

G. M. Crippen, T. F. Havel, Distance Geometry and Molecular Conformation John Wiley & Sons, New York (1988).

Homology modeling is discussed in recent computational drug design texts and

http://swift.embl-heidelberg.de/course/

S. Mosimann, R. Meleshko, M. N. G. James Proteins 23, 301 (1995). G. D. Schuler, S. F. Altschul, D. J. Lipman, Proteins 9, 190 (1991). J. Greer, Proteins, 7, 317 (1990).

P.S. Shenkin, D. L. Yarmush, R. M. Fine, H. Wang, C. Levinthal, Biopolymers 26, 2053 (1987).

J.W. Ponder, F. M. Richards, J. Mol. Biol. 193, 775 (1987). T. A. Jones, S. Thirup, EMBO J. 5, 819 (1986).

Соседние файлы в предмете Химия