Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang

.pdf
Скачиваний:
61
Добавлен:
10.08.2013
Размер:
10.5 Mб
Скачать

136

A.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B.

 

 

 

 

 

 

 

 

 

 

 

 

http://home.netscape.com/bookmark/4_/b/pcyellow.html

 

 

 

 

 

 

 

 

 

25

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

40

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Clusters

 

 

 

21

 

 

 

 

 

16

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

24

 

 

 

30

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

19

 

 

 

 

 

 

 

22

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20

15

18

 

 

 

20

 

 

17

N

 

 

 

14

 

10

 

 

 

 

 

 

 

 

 

 

 

 

Segment 1

2

3

4

5

6

7

8

9

10 11

1213

 

Residues

(19-30)

(31-57)

(58-69)

(70-99)

(100-119)

(120-144)

(145-151)

(152-162)

(163-185) (186-203)

(204-225)

(226-237)

0

(1-18)

 

C.

 

 

 

 

 

 

 

 

 

 

 

D.

(Å)k

20

(Å)k

20

 

15

15

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

 

10

Stella Veretnik and Ilya Shindyalov

0.1

0.2

0.3

0.4

 

 

r

 

5

 

 

 

5

200

 

 

100

200

300

400

100

300

400

 

n

 

 

 

n

 

 

Fig. 4.1 Domain decomposition method by Crippen. (A) Binary tree representation of hierarchical clustering of basic and intermediate unit for concanavalin A. (B) Histogram of number of occurrences N of clusters formed with contact density (data for 25 proteins). (C) Radius of gyration as a function of number of residues for 25 whole proteins. (D) Radius of gyration as a function of number of residues for each cluster for 25 proteins. Solid curves is the same as in (C), dotted curve is 1.2 times the solid curve.

size of the two units. The newly constructed cluster becomes a new unit and the process of pairing continues until only one single unit remains. Such clustering can be formalized using a binary tree, with basic units as leaves, intermediate nodes as clusters formed by two units at the level immediately below (Fig. 4.1A).

The root of the tree is the cluster that includes all of the residues. Intermediate clusters can be formed by joining groups of residues that are contiguous in sequence or alternatively by bringing two noncontiguous stretches together into a single cluster. A break fraction for a tree is defined as follows: zero indicates that all clusters in the tree consist of adjacent segments, while a break fraction of one indicates that all clusters in the tree are built from sequentially nonadjacent segments. In the real proteins break fraction is somewhere between 0 and 1, in myoglobin it is 0.262. Theoretically domains can be assigned at any level of clustering; the decision whether a given cluster in a tree constitutes a domain is defined by two parameters: contact density and radius of gyration k. Both parameters are determined by inspecting properties of the 25 protein structures. From a histogram of contact density/residue2

4. Domain Partitioning of Protein Structures

137

constructed for all 25 proteins, contact density between first and last assembled clusters varies significantly; however, the range of contact density is similar for all proteins. The threshold for contact density is chosen to be < 0.1 contact/residue2, which includes top one-third of strongest interacting clusters of the entire 300 clusters of 25 proteins (Fig. 4.1B). Similarly, the radius of gyration was calculated for each of the 25 complete proteins as well as for 300 clusters produced and plotted against the number of residues in the unit. Two curves—one formed by radius of gyration of complete proteins (Fig. 4.1C) and the other (produced by multiplying values of the first curve by 1.2)—bracket the range of gyration radii that are acceptable for domains (Fig. 4.1D). The domains are delineated by starting at the level of small clusters (close to the bottom level of the tree) and moving upwards. The status of each cluster is checked using parameters gnd k: domain status is achieved if both parameters are within an acceptable range. The resulting domain clusters correspond well to the commonly accepted concept of spatially compact structures.

The DomainParser method (Guo et al., 2003; Xu et al., 2000) is among the most recent methods published; it uses a top-down graph-theoretical approach for domain decomposition and an extensive postprocessing step. During the training stage of the algorithm multiple parameters are tuned; the method is then validated on a SCOP data set. Domain decomposition is addressed by modeling the protein structure as a network consisting of nodes (residues) and edges (connections between residues). A connection between any two residues is drawn when they are adjacent in the sequence or alternatively are in physical proximity in the structure. The strength of the interaction between two residues is expressed as the capacity of the edge to connect the two nodes. This edge capacity is a function of (a) the number of atom–atom contacts between residues, (b) the number of backbone contacts between residues, (c) the existence of backbone interactions across a -sheet, and (d) whether both residues belong to the same -strand. The values for all parameters involved in edge capacity are optimized during the training stage of the algorithm.

The partitioning of the network into two parts is then equivalent to decomposing a given structure into two domains. Ideally, partitioning should be done using the edges with least capacity, which will result in partitioning structure along least dense interactions among residues. The problem of partitioning the network is solved using maximum flow/minimum cut theorem by Ford-Fulkerson and implemented by Edmond and Karp. The gist of the approach is as follows: artificial source and sink node are added to the network (Fig. 4.2A). A “bottleneck”—a set of critical edges in the network flow—is found by gradually increasing the flow of all edges in a network. Removing the set of critical edges from the network prevents flow from the source to the sink. At this point nodes that are connected to the source represent one interconnected part of the network, while nodes connected to the sink are the second interconnected part of the network. Since the node capacity is increased gradually it is expected that nodes with least capacity (least residue–residue contacts) will be the ones contributing to the bottleneck. The process of subdividing the network into two parts is repeated multiple times by connecting the source and sink to a different part of the network; a set of minimal cuts is collected and evaluated during a postprocessing

138

Stella Veretnik and Ilya Shindyalov

A.

1

 

5

5

9

9

 

 

B.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

3

 

 

 

 

12

 

 

 

 

 

 

zero-order

 

 

8

 

 

 

 

2

 

 

80

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

second-order

 

 

 

 

 

 

7

 

 

 

 

60

 

 

 

 

second-order (misfolded)

 

2

 

4

6

10

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

40

 

 

 

 

 

 

 

 

 

9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s

1

 

4

 

1

11

t

 

20

 

 

 

 

 

 

 

 

 

2

2

H(d)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

6

3

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

11

 

 

0

5

10

15

20

25

30

35

 

3

 

 

 

-20

 

 

 

3

 

 

 

8

 

1

4

 

 

3

 

-40

 

 

 

 

 

d (Angstrome)

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

-60

 

 

 

 

 

 

 

 

 

4

 

8

4

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-80

 

 

 

 

 

 

 

 

 

 

6

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C.

 

 

 

 

 

 

25

 

Percentage

 

20

 

 

5

 

 

 

15

 

 

 

10

 

 

 

0

 

 

 

0.2

 

F.

 

100

Percentage

80

60

 

 

40

20

0

0

Correct cut overcut

Percentage

0.4

0.6

0.8

1.0

1.2

 

Cd

 

 

 

Correct cut overcut

Frequency

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

2

 

3

4

5

Number of Segments

 

D.

25

 

 

 

 

Correct cut

20

 

overcut

15

 

 

10

 

 

5

 

 

0

0.0 0.2 0.4 0.6 0.8 1.0 1.2

 

 

 

Id

 

G.

 

 

 

25

 

 

 

20

 

 

Correct cut

 

 

overcut

15

 

 

 

10

 

 

 

5

 

 

 

0

0100 200 300 400 500 600 Domain Size

Percentage

 

 

E.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

18

 

 

 

 

 

 

 

 

 

 

 

16

 

 

 

 

 

 

 

Correct cut

 

14

 

 

 

 

 

 

 

overcut

 

 

12

 

 

 

 

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

0

 

 

0.2

0.4

0.6

0.8

1.0

1.2

0.0

Td

H.

Md

Rt

Od

Sd

Fd

Cd

Ud

Td

Id

 

 

input

hidden

output [0-1]

Fig. 4.2 Domain decomposition using DomainParser algorithm. (A) Schematic representation of protein structure a directed flow network. Value on each edge represents the edge’s capacity. Artificial nodes “start” and “sink” are denoted by s and t, respectively. (B–G) Examples of parameters used by the algorithms: (B) zeroand second-order spherical moment profile of structure 2ilb;

(C) compactness of structure; (D) size of domain interface relative to domain’s volume; (E) measure of relative motion between domains; (F) distribution of the number of segments per domain;

(G) distribution of domain sizes; (H) neural network architecture for evaluation of decomposed individual domains.

step. The entire procedure is then repeated in each of the resulting domains, until either domain’s size drops below 80 residues or the partitioning produces domains that do not meet necessary conditions of domain definitions.

The stopping criteria are multifaceted and defined by (1) domain size: no less than 35 residues, (2) -sheets kept intact, (3) compactness of domain above threshold gm , (4) size of the domain–domain interface below threshold fm , (5) ratio of number of residues and number of segments in domain is above threshold ls . The values for gm , fm , ls , and minimum domain size are determined during the training stage of the algorithm.

4. Domain Partitioning of Protein Structures

139

A suite of additional parameters exists; these are involved in a post-processing step of the algorithm, in which an assessment is made whether the substructure meets the additional criteria of a structural domain. These parameters are (1) hydrophobic moment (Fig. 4.2B), (2) the number of segments in the partitioned domain (Fig. 4.2F), (3) compactness (Fig. 4.2C), (4) the size of domain interface relative to domain’s volume (Fig. 4.2D), (5) relative motion between compact domains (Fig. 4.2E). Distribution for each of the parameters for true versus false domains is collected during the training stage using 633 correctly partitioned domains and 928 incorrectly partitioned domains. Multiple neural networks are then investigated; the best one has 9 input nodes, 6 nodes in the hidden layer, and 1 output node (Fig. 4.2F). Performance of the DomainParser method is then evaluated using set of 1317 protein chains in which domains are defined by SCOP.

4.5Evaluating Automatic Methods with Manual Consensus Benchmark

Why are there so many different automatic methods for domain decomposition? A chief reason is the complexity of the problem itself: it is nearly impossible to capture succinctly the principles of domain decomposition and apply them successfully to the entire universe of protein structures. Thus, every new method strives to reach a bit further beyond existing methods in its ability to decompose complex structures. With so many different methods available, it is essential to be able to compare the performance of the algorithms and to determine what fraction of known structures any given method predicts correctly. It is equally important to know what are the strengths and weaknesses of each algorithm, in particular what types of structures are difficult for a given method. Evaluation of automatic methods is an essential part of the algorithm development process; in fact, the performance of each method is typically reported along with the algorithm. However, each method uses its own data set for evaluation of the algorithm, thus it is nearly impossible to compare the performance of the algorithms to each other. An exception to this is a set of 55 chains (Jones et al., 1998) which is frequently used to test the performance. However, this is a very small data set, thus it is likely that its resolution will be insufficient to detect differences between methods. This situation was partially rectified recently, with construction of a large comprehensive benchmark data set developed specifically for evaluation of domain decomposition algorithms (Veretnik et al., 2004). The data set is assembled using a principle of consensus approach among expert methods: it includes proteins for which three expert methods (CATH, SCOP, and AUTHORS) produce similar domain decomposition. There are a total of 374 proteins in this benchmark, which was used to evaluate three recent automatic methods: PUU, DomainParser, and PDP (Fig. 4.3A).

The evaluation includes information about the success rate of each algorithm, analysis of errors in terms of predicting fewer domains (undercut) or too many domains (overcut). The analysis further looks into the tendencies of the methods

Fig. 4.3 Benchmarking of automatic domain assignment methods. (A) Performance of DomainParser, PDP, and PUU on consensus-based benchmark of 374 structures. (B) Evaluating tendency to partion domains into noncontiguous fragments.

4. Domain Partitioning of Protein Structures

141

Fig. 4.4 Examples of incorrectly assigned structures by DomainParser, PDP, and PUU.

to fragment domains into noncontiguous stretches of polypeptide chain (Fig. 4.3B). A very important part of this evaluation process is the systematic analysis of what types of structures are difficult for each automatic method and what types of errors are typical for each algorithm. This analysis reveals that the PUU method tends to produce very compact domains which consist of many discontinuous fragments. Two methods, PUU and PDP, tend to overcut protein structures by continuing to partition actual domains; PUU exhibits more frequent and severe cases of overcuts (Fig. 4.4). The DomainParser method, on the other hand, tends to undercut structures—it produces fewer domains than human experts. This appears to be related to splitting structures with -sheets close to the domain interface—a tricky

142

Stella Veretnik and Ilya Shindyalov

Fig. 4.5 Distribution of singleand multi-domain proteins in PDB. Domains are defined using CATH method. The distribution by domain number is given separately for archaea, bacteria, and eukaryotes as well as to all structures in PDB.

issue (there are several parameters dealing with -structure in the DomainParser algorithm). A similar problem also exists for the PUU method, which frequently fails to cut -structures, even though it tends to cut structures too much in other cases (PUU too has a parameter dealing with -structure partitioning).

Another serious issue that hurts development of new methods is a severe bias of proteins in PDB toward one-domain proteins: nearly 70% of structures are single domains (Fig. 4.5). This overrepresentation of simple structures in PDB is due to the difficulties associated with obtaining the crystal structures of complex multidomain chains. This poverty of complex structures cripples the ability of computational methods to infer principles of decomposition of multidomain proteins. A new more comprehensive benchmarking data set, which covers many more architectures and topologies and includes a larger fraction of multidomain structures, has been recently published (Holland et al., 2006).

4.6 Future Goals

While benchmarking data sets are invaluable in cross-comparing methods as well as aiding in understanding the weaknesses of current methods, the very principles underlying benchmark design requires consensus among human experts. Thus, the most difficult and contentious cases of architectures are not even addressed by the above benchmark. The very existence of the architectures for which multiple

4. Domain Partitioning of Protein Structures

143

plausible domain decompositions exist refutes our simpleminded tendency to fit one approach for partitioning to all structures. As more protein structures are solved, the fraction of such “controversial” proteins is likely to increase. The best way to address this inherent complexity of the protein structures might be to accept the possibility of alternative domain decompositions and implement this feature in new algorithms. One of the latest algorithms has such a capacity already (Berezovsky, 2003), which simply uses a series of thresholds instead of a single threshold during structure decomposition. In general it appears that the main difficulty the algorithms have is proceeding with partitioning too far or not far enough, rather than making partitions in incorrect regions of the structure. This situation can be possibly remedied by performing domain decomposition under multiple thresholds. Allowing multiple thresholds is likely to produce single solutions for simple structures and multiple solutions in the cases of complex architectures. The introduction of multiple thresholds into existing algorithms should be relatively simple. The future success of algorithms for domain decomposition may require a shift in our thinking about what constitutes a good solution for this complex problem; this is likely to involve considering alternative decomposition scenarios as an essential part of the solution.

4.7 Summary

Domain decomposition of 3D structures is an important and not completely solved problem. An astonishingly wide array of approaches had been implemented in an attempt to automate this process. Currently, the best algorithms resolve more than 80% of the structures for which human experts reach consensus for domain partitioning. As more complex structures are solved, we do not expect this success rate to increase dramatically. It would be timely to reevaluate our approach to the problem of domain decomposition and our expectation of reaching a single solution in every case. Rather it would be constructive to accept the fact that there are multiple legitimate decomposition schemes for complex architectures and adapt future algorithms to deal with such a possibility.

4.8 Suggested Further Reading

For early fundamental works on protein domain definitions there are seminal papers by Wetlaufer and Ristow (1973) and Richardson (1985). For a recent comprehensive review on structural domains the paper by Ponting and Russell (2002) is recommended. Evolution of domains is discussed well in Ponting et al. (2000) and Todd et al. (2001). The thorny topic of correspondence between protein domains and exons in genes is extensively discussed; a good start is the book Protein Evolution (Patthy, 1999). Finally, discussion of the protein universe through the lens of structural domains can be found in Holm and Sander (1996) and Orengo et al. (1994).

144

Stella Veretnik and Ilya Shindyalov

References

Alexandrov, N., and Shindyalov, I. 2003. PDP: Protein domain parser. Bioinformatics. 19:429–430.

Berezovsky, I. N. 2003. Discrete structure of van der Waals domains in globular proteins. Protein Eng. 16:161–167.

Crippen, G. M. 1978. The tree structural organization of proteins. J. Mol. Biol. 126:315–332.

Doolittle, R. F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64:287–314.

Guo, J. T., Xu, D., Kim, D., and Xu, Y. 2003. Improving the performance of DomainParser for structural domain partition using neural network. Nucleic Acids Res. 31:944–952.

Holland, T. A., Veretnik, S., Shindyalov, I. N., and Bourne, P. E. 2006. Partitioning protein structure into domains: Why is it so difficult? J. Mol. Biol. 361:562– 590.

Holm, L., and Sander, C. 1994. Parser for protein folding units. Proteins 19, 256–268. Holm, L., and Sander, C. 1996. Mapping the protein universe. Science 273:595–603. Islam, S. A., Luo, J., and Sternberg, M. J. 1995. Identification and analysis of domains

in proteins. Protein Eng. 8:513–525.

Jones, S., Stewart, M., Michie, A., Swindells, M. B., Orengo, C., and Thornton, J. M. 1998. Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Sci. 7:233–242.

Kundu, S., Sorensen, D. C., and Phillips, G. N., Jr. 2004. Automatic domain decomposition of proteins by a Gaussian network model. Proteins 57:725–733.

Levitt, M., and Greer, J. 1977. Automatic identification of secondary structure in globular proteins. J. Mol. Biol. 114:181–239.

Murzin, A. G., Brenner, S. E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536–540.

Orengo, C. A., Jones, D. T., and Thornton, J. M. 1994. Protein superfamilies and domain superfolds. Nature 372:631–634.

Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton, J. M. 1997. CATH—A hierarchic classification of protein domain structures. Structure 5:1093–1108.

Patthy, L. 1999. Protein Evolution. Oxford, Blackwell Science.

Ponting, C. P., and Russell, R. R. 2002. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 31:45–71.

Ponting, C. P., Schultz, J., Copley, R. R., Andrade, M. A., and Bork, P. 2000. Evolution of domain families. Adv. Protein. Chem. 54:185–244.

Richardson, J. S. 1985. Describing patterns of protein tertiary structure. Methods Enzymol. 115:341–358.

4. Domain Partitioning of Protein Structures

145

Rose, G. D. 1979. Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134:447–470.

Rossman, M. G., and Liljas, A. 1974. Letter: Recognition of structural domains in globular proteins. J. Mol. Biol. 85:177–181.

Siddiqui, A. S., and Barton, G. J. 1995. Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain definitions.

Protein Sci. 4:872–884.

Sowdhamini, R., and Blundell, T. L. 1995. An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins.

Protein Sci. 4:506–520.

Swindells, M. B. 1995a. A procedure for detecting structural domains in proteins.

Protein Sci. 4:103–112.

Swindells, M. B. 1995b. A procedure for the automatic determination of hydrophobic cores in protein structures. Protein Sci. 4:93–102.

Taylor, W. R. 1999. Protein structural domain identification. Protein Eng. 12:203– 216.

Todd, A. E., Orengo, C. A., and Thornton, J. M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307:1113–1143.

Veretnik, S., Bourne, P. E., Alexandrov, N. N., and Shindyalov, I. N. 2004. Toward consistent assignment of structural domains in proteins. J. Mol. Biol. 339:647– 678.

Wernisch, L., Hunting, M., and Wodak, S. J. 1999. Identification of structural domains in proteins by a graph heuristic. Proteins 35:338–352.

Wernisch, L., and Wodak, S. J. 2003. Identifying structural domains in proteins.

Methods Biochem. Anal. 44:365–385.

Wetlaufer, D. B. 1973. Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl. Acad. Sci. USA 70:697–701.

Wetlaufer, D. B., and Ristow, S. 1973. Acquisition of three-dimensional structure of proteins. Annu. Rev. Biochem. 42:135–158.

Wodak, S. J., and Janin, J. 1981. Location of structural domains in protein. Biochemistry 20:6544–6552.

Xu, Y., Xu, D., and Gabow, H. N. 2000. Protein domain decomposition using a graph-theoretic approach. Bioinformatics 16:1091–1104.

Xuan, Z. Y., Ling, L. J., and Chen, R. S. 2000. A new method for protein domain recognition. Eur. Biophys. J. 29:7–16.