Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
68
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

Comparative Protein Structure Modeling

295

that the model is correct [204,205]. Thus, distributions of many spatial features have been compiled from high resolution protein structures, and any large deviations from the most likely values have been interpreted as strong indicators of errors in the model. Such features include packing [206], formation of a hydrophobic core [207], residue and atomic solvent accessibilities [208–212], spatial distribution of charged groups [213], distribution of atom–atom distances [214], atomic volumes [215], and main chain hydrogen bonding [200].

Another group of methods for testing 3D models that implicitly take into account many of the criteria listed above involve 3D profiles and statistical potentials [87,216]. These methods evaluate the environment of each residue in a model with respect to the expected environment as found in the high resolution X-ray structures. Programs implementing this approach include VERIFY3D [216], PROSA [217], HARMONY [218], and ANOLEA [120].

An additional role of the model evaluation methods is to help in the actual modeling procedure. In principle, an improvement in the accuracy of a model is possible by incorporating the quality criteria into a scoring function being optimized to derive the model in the first place.

VI. APPLICATIONS OF COMPARATIVE MODELING

Comparative modeling is often an efficient way to obtain useful information about the proteins of interest. For example, comparative models can be helpful in designing mutants to test hypotheses about the protein’s function [89,219]; identifying active and binding sites [220]; searching for, designing, and improving ligands for a given binding site [221]; modeling substrate specificity [222]; predicting antigenic epitopes [223]; simulating pro- tein–protein docking [224]; inferring function from calculated electrostatic potential around the protein [225]; facilitating molecular replacement in X-ray structure determination [226]; refining models based on NMR constraints [227]; testing and improving a sequence–structure alignment [228]; confirming a remote structural relationship [59]; and rationalizing known experimental observations. For an exhaustive review of comparative modeling applications, see Ref. 3.

Fortunately, a 3D model does not have to be absolutely perfect to be helpful in biology, as demonstrated by the applications listed above. However, the type of question that can be addressed with a particular model does depend on the model’s accuracy. At the low end of the accuracy spectrum, there are models that are based on less than 25%

˚

sequence identity and have sometimes less than 50% of their Cα atoms within 3.5 A of their correct positions. However, such models still have the correct fold, and even knowing only the fold of a protein is frequently sufficient to predict its approximate biochemical function. More specifically, only nine out of 80 fold families known in 1994 contained proteins (domains) that were not in the same functional class, although 32% of all protein structures belonged to one of the nine superfolds [229]. Models in this low range of accuracy combined with model evaluation can be used for confirming or rejecting a match between remotely related proteins [9,58].

In the middle of the accuracy spectrum are the models based on approximately 35%

˚

sequence identity, corresponding to 85% of the Cα atoms modeled within 3.5 A of their correct positions. Fortunately, the active and binding sites are frequently more conserved

296

Fiser et al.

than the rest of the fold and are thus modeled more accurately [9]. In general, medium resolution models frequently allow a refinement of the functional prediction based on sequence alone, because ligand binding is most directly determined by the structure of the binding site rather than its sequence. It is frequently possible to predict correctly important features of the target protein that do not occur in the template structure. For example, the location of a binding site can be predicted from clusters of charged residues [225], and the size of a ligand may be predicted from the volume of the binding site cleft [222]. Medium resolution models can also be used to construct site-directed mutants with altered or destroyed binding capacity, which in turn could test hypotheses about the sequence– structure–function relationships. Other problems that can be addressed with medium resolution comparative models include designing proteins that have compact structures without long tails, loops, and exposed hydrophobic residues for better crystallization and designing proteins with added disulfide bonds for extra stability.

The high end of the accuracy spectrum corresponds to models based on 50% sequence identity or more. The average accuracy of these models approaches that of low

˚

resolution X-ray structures (3 A resolution) or medium resolution NMR structures (10 distance restraints per residue) [58]. The alignments on which these models are based generally contain almost no errors. In addition to the already listed applications, high quality models can be used for docking of small ligands [221] or whole proteins onto a given protein [224,230].

We now describe two applications of comparative modeling in more detail: (1) Modeling of substrate specificity aided by a high accuracy model and (2) confirming a remote structural relationship based on a low accuracy model.

(a)

(b)

Figure 10 Models of complexes between BLBP and two different fatty acids. The fatty acid ligand is shown in the CPK representation. The small spheres in the ligand-binding cavity are water molecules. (a) Model of the BLBP–oleic acid complex, in which the cavity is not filled. (b) Model of the BLBP–docosahexaenoic acid complex, in which the cavity is filled. The figure was prepared using the program MOLSCRIPT [236].

Comparative Protein Structure Modeling

297

A. Ligand Specificity of Brain Lipid-Binding Protein

Brain lipid-binding protein (BLBP) is a member of the family of fatty acid binding proteins that was isolated from brain [222]. The problem was to find out which one of the many fatty acids known to bind to fatty acid binding proteins in general is the likely physiological ligand of BLBP. To address this problem, comparative models of BLBP complexed with many fatty acids were calculated by relying on the structures of the adipocyte lipid-binding protein and muscle fatty acid binding protein, in complex with their ligands. The models were evaluated by binding and site-directed mutagenesis experiments [222]. The model of BLBP indicated that its binding cavity was just large enough to accommodate docosahexaenoic acid (DHA) (Fig. 10). Because DHA filled the BLBP binding cavity completely, it was unlikely that BLBP would bind a larger ligand. Thus, DHA was the ligand predicted to have the highest affinity for BLBP. The prediction was confirmed by the measurement of binding affinities for many fatty acids. It turned out that the BLBP–DHA interaction was the strongest fatty acid–protein interaction known to date. The binding affinities of

(a)

(b)

Figure 11 Confirming structural similarity between the E. coli δ′ subunit of DNA polymerase III and RuvB. (a) A sequence alignment between the δ′ subunit and RuvB. (b) ProsaII profiles for the X-ray structure of the δ′ subunit (thin continuous line), Z 11.0; a model of RuvB based on its alignment to the δ′ subunit (thick line), Z 7.3; and a test model based on an incorrect alignment (dashed line), Z 0.9. The RuvB model based on the correct alignment has a significant Z-score and only a few positive peaks in the profile. This indicates that the model is plausible and that RuvB is indeed related structurally to the E. coli δ′ subunit. (From Ref. 217.)

298

Fiser et al.

the ligands correlated with the surface areas buried by the protein–ligand interactions, as calculated from the corresponding models, and explained why DHA had the highest affinity.

This case illustrates how a comparative model provides new information that cannot be deduced directly from the template structures despite their high (60%) sequence identity to BLBP. The two templates have smaller binding sites and consequently different patterns of binding affinities for the same set of ligands. The study also illustrated how new information is obtained relative to the target–template alignment even when the similarity between the target and the template sequences is high. The volumes and contact surfaces can be calculated only from a 3D model.

B. Finding Proteins Remotely Related to the E. coli Subunit

The structure of the δ′ subunit of the clamp–loader complex of E. coli DNA polymerase III was determined by X-ray crystallography [59]. Several biological considerations and extremely weak sequence patterns indicated that δ′ may be structurally related to the RuvB family of DNA helicases. However, the relationship was not possible to prove on the basis of the alignment of the corresponding sequences alone; the sequence identities ranged from only 9% to 21%. To substantiate the putative match, comparative models for several RuvB helicases were constructed using the crystal structure of the δ′ subunit as the template. The models were evaluated by calculating their PROSAII Z-scores and energy profiles [217] (Fig. 11). This evaluation indicated strongly that the model is plausible and that RuvB is indeed related structurally to the E. coli δ′ subunit.

VII. COMPARATIVE MODELING IN STRUCTURAL GENOMICS

In a few years, the genome projects will have provided us with the amino acid sequences of more than a million proteins—the catalysts, inhibitors, messengers, receptors, transporters, and building blocks of the living organisms. The full potential of the genome projects will be realized only when we assign and understand the function of these new proteins. This will be facilitated by structural information for all or almost all proteins. This aim will be achieved by structural genomics, a focused, large-scale determination of protein structures by X-ray crystallography and nuclear magnetic resonance spectroscopy, combined efficiently with accurate, automated, and large-scale comparative protein structure modeling techniques [231]. Given current modeling techniques, it seems reasonable to require models based on at least 30% sequence identity, corresponding to one experimentally determined structure per sequence family rather than fold family. Since there are 1000–5000 fold families and perhaps about five times as many sequence families [16], the experimental effort in structural genomics has to deliver at least 10,000 protein domain structures.

To enable the large-scale comparative modeling needed for structural genomics, the steps of comparative modeling are being assembled into a completely automated pipeline. Because many computer programs for performing each of the operations in comparative modeling already exist, it may seem trivial to construct a pipeline that completely automates the whole process. In fact, it is not easy to do so in a robust manner. For a good

Comparative Protein Structure Modeling

299

Figure 12 ModBase, a database of comparative protein structure models. Screenshots of the following ModBase panels are shown: A form for searching for the models of a given protein, summary of the search results, summary of the models of a given protein, details about a single model, alignment on which a given model was based, 3D model displayed by RASMOL [237], and a model evaluation by the ProsaII profile [217].

300 Fiser et al.

reason, most of the tasks in modeling of individual proteins, including template selection, alignment, and model evaluation, are typically performed with significant human intervention. This allows the use of the best tool for a particular problem at hand and consideration of many different sources of information that are difficult to take into account entirely automatically. Because large-scale modeling can be performed only in a completely automated manner, the main challenge is to build an automated and robust pipeline that approaches the performance of a human expert as much as possible.

Two applications of comparative modeling to complete genomes have been described. For the sequences encoded in the E. coli genome, models were built for 10–15% of the proteins using the SWISS-MODEL web server [232,233]. Peitsch et al. have recently also modeled many proteins in SWISS-PROT and made the models available on their SWISS-MODEL web site (see Table 1). Another large-scale modeling study was our own modeling of five prokaryotic and eukaryotic genomes [9]. The calculation resulted in the models for substantial segments of 17.2%, 18.1%, 19.2%, 20.4%, and 15.7% of all proteins in the genomes of Saccharomyces cerevisiae (6218 proteins in the genome);

Escherichia coli (4290 proteins), Mycoplasma genitalium (468 proteins), Caenorhabditis elegans (7299 proteins, imcomplete), and Methanococcus janaschii (1735 proteins), respectively. An important feature of this study was an evaluation of all the models. This evaluation is important because most of the related protein pairs share less than 30% sequence identity, resulting in significant errors in the models. The models were assigned into the reliable or unreliable class by a procedure [9] that relies on the statistical potential function from PROSAII [217]. This allowed identification of those models that were likely to be based on correct templates and at least approximately correct alignments. As a result, 236 yeast proteins without any prior structural information were assigned to a particular fold family; 40 of these proteins did not have any prior functional annotation. The models were also evaluated more precisely by using a calibrated relationship between the model accuracy and the percentage sequence identity on which the model is based [9]. Almost half of the 1071 reliably modeled proteins in the yeast genome share more than approximately 35% sequence identity with their templates. All the alignments, models, and model evaluations are available in the ModBase database of comparative protein structure models (Fig. 12) [234]. Most recently, the combined use of PSI-BLAST [36] with the model building and a new model evaluation [9] allowed us to calculate reliable models for 50%

of the proteins in the TrEMBL database (R. Sanchez,´

ˇ

F. Mels, A. Sali, in preparation)

[234].

 

Large-scale comparative modeling opens new opportunities for tackling existing problems by virtue of providing many protein models from many genomes. One example is the selection of a target protein for which a drug needs to be developed. A good choice is a protein that is likely to have high ligand specificity; specificity is important because specific drugs are less likely to be toxic. Large-scale modeling facilitates imposing the specificity filter in target selection by enabling a structural comparison of the ligand binding sites of many proteins, either human or from other organisms. Such comparisons may make it possible to select rationally a target whose binding site is structurally most different from the binding sites of all the other proteins that may potentially react with the same drug. For example, when a human pathogenic organism needs to be inhibited, a good target may be a protein whose binding site shape is different from related binding sites in all of the human proteins. Alternatively, when a human metabolic pathway needs to be regulated, the target identification could focus on that particular protein in the pathway that has the binding site most dissimilar from its human homologs.