Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Биомеханика

Файл:

Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang

.pdf

Скачиваний:

Добавлен:

10.08.2013

Размер:

10.5 Mб

Скачать

☆

<<< < Предыдущая 4 5 6 7 8 9 10 11 12 13 14 1516 / 4116 17 18 19 20 21 22 23 24 25 26 27 28 > Следующая >>>

136

A.									B.
					http://home.netscape.com/bookmark/4_/b/pcyellow.html
				25
	23							40

Clusters		21			16
		21

						24		30
						24

			19				22
			19				22

												20
15		18					20			17		N
15		18					14			17		10
												10
Segment 1	2	3	4	5	6	7	8	9	10 11	1213
Residues	(19-30)	(31-57)	(58-69)	(70-99)	(100-119)	(120-144)	(145-151)	(152-162)	(163-185) (186-203)	(204-225)	(226-237)	0
(1-18)	(19-30)	(31-57)	(58-69)	(70-99)	(100-119)	(120-144)	(145-151)	(152-162)	(163-185) (186-203)	(204-225)	(226-237)
C.												D.

(Å)k	20		(Å)k	20
(Å)k		15	(Å)k	15
		15


	10			10

Stella Veretnik and Ilya Shindyalov

0.1	0.2	0.3	0.4
		r

5				5	200
100	200	300	400	100	200	300	400
	n				n

Fig. 4.1 Domain decomposition method by Crippen. (A) Binary tree representation of hierarchical clustering of basic and intermediate unit for concanavalin A. (B) Histogram of number of occurrences N of clusters formed with contact density (data for 25 proteins). (C) Radius of gyration as a function of number of residues for 25 whole proteins. (D) Radius of gyration as a function of number of residues for each cluster for 25 proteins. Solid curves is the same as in (C), dotted curve is 1.2 times the solid curve.

size of the two units. The newly constructed cluster becomes a new unit and the process of pairing continues until only one single unit remains. Such clustering can be formalized using a binary tree, with basic units as leaves, intermediate nodes as clusters formed by two units at the level immediately below (Fig. 4.1A).

The root of the tree is the cluster that includes all of the residues. Intermediate clusters can be formed by joining groups of residues that are contiguous in sequence or alternatively by bringing two noncontiguous stretches together into a single cluster. A break fraction for a tree is deﬁned as follows: zero indicates that all clusters in the tree consist of adjacent segments, while a break fraction of one indicates that all clusters in the tree are built from sequentially nonadjacent segments. In the real proteins break fraction is somewhere between 0 and 1, in myoglobin it is 0.262. Theoretically domains can be assigned at any level of clustering; the decision whether a given cluster in a tree constitutes a domain is deﬁned by two parameters: contact density and radius of gyration k. Both parameters are determined by inspecting properties of the 25 protein structures. From a histogram of contact density/residue2

4. Domain Partitioning of Protein Structures

137

constructed for all 25 proteins, contact density between ﬁrst and last assembled clusters varies signiﬁcantly; however, the range of contact density is similar for all proteins. The threshold for contact density is chosen to be < 0.1 contact/residue2, which includes top one-third of strongest interacting clusters of the entire 300 clusters of 25 proteins (Fig. 4.1B). Similarly, the radius of gyration was calculated for each of the 25 complete proteins as well as for 300 clusters produced and plotted against the number of residues in the unit. Two curves—one formed by radius of gyration of complete proteins (Fig. 4.1C) and the other (produced by multiplying values of the ﬁrst curve by 1.2)—bracket the range of gyration radii that are acceptable for domains (Fig. 4.1D). The domains are delineated by starting at the level of small clusters (close to the bottom level of the tree) and moving upwards. The status of each cluster is checked using parameters gnd k: domain status is achieved if both parameters are within an acceptable range. The resulting domain clusters correspond well to the commonly accepted concept of spatially compact structures.

The DomainParser method (Guo et al., 2003; Xu et al., 2000) is among the most recent methods published; it uses a top-down graph-theoretical approach for domain decomposition and an extensive postprocessing step. During the training stage of the algorithm multiple parameters are tuned; the method is then validated on a SCOP data set. Domain decomposition is addressed by modeling the protein structure as a network consisting of nodes (residues) and edges (connections between residues). A connection between any two residues is drawn when they are adjacent in the sequence or alternatively are in physical proximity in the structure. The strength of the interaction between two residues is expressed as the capacity of the edge to connect the two nodes. This edge capacity is a function of (a) the number of atom–atom contacts between residues, (b) the number of backbone contacts between residues, (c) the existence of backbone interactions across a -sheet, and (d) whether both residues belong to the same -strand. The values for all parameters involved in edge capacity are optimized during the training stage of the algorithm.

The partitioning of the network into two parts is then equivalent to decomposing a given structure into two domains. Ideally, partitioning should be done using the edges with least capacity, which will result in partitioning structure along least dense interactions among residues. The problem of partitioning the network is solved using maximum ﬂow/minimum cut theorem by Ford-Fulkerson and implemented by Edmond and Karp. The gist of the approach is as follows: artiﬁcial source and sink node are added to the network (Fig. 4.2A). A “bottleneck”—a set of critical edges in the network ﬂow—is found by gradually increasing the ﬂow of all edges in a network. Removing the set of critical edges from the network prevents ﬂow from the source to the sink. At this point nodes that are connected to the source represent one interconnected part of the network, while nodes connected to the sink are the second interconnected part of the network. Since the node capacity is increased gradually it is expected that nodes with least capacity (least residue–residue contacts) will be the ones contributing to the bottleneck. The process of subdividing the network into two parts is repeated multiple times by connecting the source and sink to a different part of the network; a set of minimal cuts is collected and evaluated during a postprocessing

138	Stella Veretnik and Ilya Shindyalov

A.	1		5	5	9	9			B.
A.	1			5		9			B.
		2	3					12							zero-order
8		2				2		12		80					zero-order
8				2		2				80					second-order
				2	7					60					second-order (misfolded)
	2		4	6	7	10	4								second-order (misfolded)

										40
	9									40
s	9	1		4		1	11	t		20
s		1	2	4	2	1	11	t	H(d)	20
			2							0
6	3			7
						11					0	5	10	15	20	25	30	35
			3							-20
					3					-20
8		1		4	3			3		-40						d (Angstrome)
8						2		3

			5							-60
	4		5	8	4	12				-60
					4					-80
			6		3

	C.

		25
Percentage		20
Percentage		5
		15
		10
		0
		0.2
	F.
	100
Percentage	80
Percentage	60
	60
	40

Correct cut overcut

Percentage

0.4	0.6	0.8	1.0	1.2
	Cd

Correct cut overcut

Frequency




1	2	3	4	5
Number of Segments

25
		Correct cut
20		overcut
15
10
5

0.0 0.2 0.4 0.6 0.8 1.0 1.2

		Id
	G.
25
20			Correct cut
20			overcut
15
10
5

0100 200 300 400 500 600 Domain Size

Percentage

	E.

18
16						Correct cut
14						overcut
12
10
8
6
4
2
0		0.2	0.4	0.6	0.8		1.0	1.2
0.0		0.2	0.4	0.6	0.8		1.0	1.2

Id
input	hidden	output [0-1]

Fig. 4.2 Domain decomposition using DomainParser algorithm. (A) Schematic representation of protein structure a directed ﬂow network. Value on each edge represents the edge’s capacity. Artiﬁcial nodes “start” and “sink” are denoted by s and t, respectively. (B–G) Examples of parameters used by the algorithms: (B) zeroand second-order spherical moment proﬁle of structure 2ilb;

(C) compactness of structure; (D) size of domain interface relative to domain’s volume; (E) measure of relative motion between domains; (F) distribution of the number of segments per domain;

(G) distribution of domain sizes; (H) neural network architecture for evaluation of decomposed individual domains.

step. The entire procedure is then repeated in each of the resulting domains, until either domain’s size drops below 80 residues or the partitioning produces domains that do not meet necessary conditions of domain deﬁnitions.

The stopping criteria are multifaceted and deﬁned by (1) domain size: no less than 35 residues, (2) -sheets kept intact, (3) compactness of domain above threshold gm , (4) size of the domain–domain interface below threshold fm , (5) ratio of number of residues and number of segments in domain is above threshold ls . The values for gm , fm , ls , and minimum domain size are determined during the training stage of the algorithm.

4. Domain Partitioning of Protein Structures

139

A suite of additional parameters exists; these are involved in a post-processing step of the algorithm, in which an assessment is made whether the substructure meets the additional criteria of a structural domain. These parameters are (1) hydrophobic moment (Fig. 4.2B), (2) the number of segments in the partitioned domain (Fig. 4.2F), (3) compactness (Fig. 4.2C), (4) the size of domain interface relative to domain’s volume (Fig. 4.2D), (5) relative motion between compact domains (Fig. 4.2E). Distribution for each of the parameters for true versus false domains is collected during the training stage using 633 correctly partitioned domains and 928 incorrectly partitioned domains. Multiple neural networks are then investigated; the best one has 9 input nodes, 6 nodes in the hidden layer, and 1 output node (Fig. 4.2F). Performance of the DomainParser method is then evaluated using set of 1317 protein chains in which domains are deﬁned by SCOP.

4.5Evaluating Automatic Methods with Manual Consensus Benchmark

Why are there so many different automatic methods for domain decomposition? A chief reason is the complexity of the problem itself: it is nearly impossible to capture succinctly the principles of domain decomposition and apply them successfully to the entire universe of protein structures. Thus, every new method strives to reach a bit further beyond existing methods in its ability to decompose complex structures. With so many different methods available, it is essential to be able to compare the performance of the algorithms and to determine what fraction of known structures any given method predicts correctly. It is equally important to know what are the strengths and weaknesses of each algorithm, in particular what types of structures are difﬁcult for a given method. Evaluation of automatic methods is an essential part of the algorithm development process; in fact, the performance of each method is typically reported along with the algorithm. However, each method uses its own data set for evaluation of the algorithm, thus it is nearly impossible to compare the performance of the algorithms to each other. An exception to this is a set of 55 chains (Jones et al., 1998) which is frequently used to test the performance. However, this is a very small data set, thus it is likely that its resolution will be insufﬁcient to detect differences between methods. This situation was partially rectiﬁed recently, with construction of a large comprehensive benchmark data set developed specifically for evaluation of domain decomposition algorithms (Veretnik et al., 2004). The data set is assembled using a principle of consensus approach among expert methods: it includes proteins for which three expert methods (CATH, SCOP, and AUTHORS) produce similar domain decomposition. There are a total of 374 proteins in this benchmark, which was used to evaluate three recent automatic methods: PUU, DomainParser, and PDP (Fig. 4.3A).

The evaluation includes information about the success rate of each algorithm, analysis of errors in terms of predicting fewer domains (undercut) or too many domains (overcut). The analysis further looks into the tendencies of the methods

Fig. 4.3 Benchmarking of automatic domain assignment methods. (A) Performance of DomainParser, PDP, and PUU on consensus-based benchmark of 374 structures. (B) Evaluating tendency to partion domains into noncontiguous fragments.

4. Domain Partitioning of Protein Structures

141

Fig. 4.4 Examples of incorrectly assigned structures by DomainParser, PDP, and PUU.

to fragment domains into noncontiguous stretches of polypeptide chain (Fig. 4.3B). A very important part of this evaluation process is the systematic analysis of what types of structures are difﬁcult for each automatic method and what types of errors are typical for each algorithm. This analysis reveals that the PUU method tends to produce very compact domains which consist of many discontinuous fragments. Two methods, PUU and PDP, tend to overcut protein structures by continuing to partition actual domains; PUU exhibits more frequent and severe cases of overcuts (Fig. 4.4). The DomainParser method, on the other hand, tends to undercut structures—it produces fewer domains than human experts. This appears to be related to splitting structures with -sheets close to the domain interface—a tricky

142	Stella Veretnik and Ilya Shindyalov

Fig. 4.5 Distribution of singleand multi-domain proteins in PDB. Domains are deﬁned using CATH method. The distribution by domain number is given separately for archaea, bacteria, and eukaryotes as well as to all structures in PDB.

issue (there are several parameters dealing with -structure in the DomainParser algorithm). A similar problem also exists for the PUU method, which frequently fails to cut -structures, even though it tends to cut structures too much in other cases (PUU too has a parameter dealing with -structure partitioning).

Another serious issue that hurts development of new methods is a severe bias of proteins in PDB toward one-domain proteins: nearly 70% of structures are single domains (Fig. 4.5). This overrepresentation of simple structures in PDB is due to the difﬁculties associated with obtaining the crystal structures of complex multidomain chains. This poverty of complex structures cripples the ability of computational methods to infer principles of decomposition of multidomain proteins. A new more comprehensive benchmarking data set, which covers many more architectures and topologies and includes a larger fraction of multidomain structures, has been recently published (Holland et al., 2006).

4.6 Future Goals

While benchmarking data sets are invaluable in cross-comparing methods as well as aiding in understanding the weaknesses of current methods, the very principles underlying benchmark design requires consensus among human experts. Thus, the most difﬁcult and contentious cases of architectures are not even addressed by the above benchmark. The very existence of the architectures for which multiple

4. Domain Partitioning of Protein Structures

143

plausible domain decompositions exist refutes our simpleminded tendency to ﬁt one approach for partitioning to all structures. As more protein structures are solved, the fraction of such “controversial” proteins is likely to increase. The best way to address this inherent complexity of the protein structures might be to accept the possibility of alternative domain decompositions and implement this feature in new algorithms. One of the latest algorithms has such a capacity already (Berezovsky, 2003), which simply uses a series of thresholds instead of a single threshold during structure decomposition. In general it appears that the main difﬁculty the algorithms have is proceeding with partitioning too far or not far enough, rather than making partitions in incorrect regions of the structure. This situation can be possibly remedied by performing domain decomposition under multiple thresholds. Allowing multiple thresholds is likely to produce single solutions for simple structures and multiple solutions in the cases of complex architectures. The introduction of multiple thresholds into existing algorithms should be relatively simple. The future success of algorithms for domain decomposition may require a shift in our thinking about what constitutes a good solution for this complex problem; this is likely to involve considering alternative decomposition scenarios as an essential part of the solution.

4.7 Summary

Domain decomposition of 3D structures is an important and not completely solved problem. An astonishingly wide array of approaches had been implemented in an attempt to automate this process. Currently, the best algorithms resolve more than 80% of the structures for which human experts reach consensus for domain partitioning. As more complex structures are solved, we do not expect this success rate to increase dramatically. It would be timely to reevaluate our approach to the problem of domain decomposition and our expectation of reaching a single solution in every case. Rather it would be constructive to accept the fact that there are multiple legitimate decomposition schemes for complex architectures and adapt future algorithms to deal with such a possibility.

4.8 Suggested Further Reading

For early fundamental works on protein domain deﬁnitions there are seminal papers by Wetlaufer and Ristow (1973) and Richardson (1985). For a recent comprehensive review on structural domains the paper by Ponting and Russell (2002) is recommended. Evolution of domains is discussed well in Ponting et al. (2000) and Todd et al. (2001). The thorny topic of correspondence between protein domains and exons in genes is extensively discussed; a good start is the book Protein Evolution (Patthy, 1999). Finally, discussion of the protein universe through the lens of structural domains can be found in Holm and Sander (1996) and Orengo et al. (1994).

144	Stella Veretnik and Ilya Shindyalov

References

Alexandrov, N., and Shindyalov, I. 2003. PDP: Protein domain parser. Bioinformatics. 19:429–430.

Berezovsky, I. N. 2003. Discrete structure of van der Waals domains in globular proteins. Protein Eng. 16:161–167.

Crippen, G. M. 1978. The tree structural organization of proteins. J. Mol. Biol. 126:315–332.

Doolittle, R. F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64:287–314.

Guo, J. T., Xu, D., Kim, D., and Xu, Y. 2003. Improving the performance of DomainParser for structural domain partition using neural network. Nucleic Acids Res. 31:944–952.

Holland, T. A., Veretnik, S., Shindyalov, I. N., and Bourne, P. E. 2006. Partitioning protein structure into domains: Why is it so difﬁcult? J. Mol. Biol. 361:562– 590.

Holm, L., and Sander, C. 1994. Parser for protein folding units. Proteins 19, 256–268. Holm, L., and Sander, C. 1996. Mapping the protein universe. Science 273:595–603. Islam, S. A., Luo, J., and Sternberg, M. J. 1995. Identiﬁcation and analysis of domains

in proteins. Protein Eng. 8:513–525.

Jones, S., Stewart, M., Michie, A., Swindells, M. B., Orengo, C., and Thornton, J. M. 1998. Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Sci. 7:233–242.

Kundu, S., Sorensen, D. C., and Phillips, G. N., Jr. 2004. Automatic domain decomposition of proteins by a Gaussian network model. Proteins 57:725–733.

Levitt, M., and Greer, J. 1977. Automatic identiﬁcation of secondary structure in globular proteins. J. Mol. Biol. 114:181–239.

Murzin, A. G., Brenner, S. E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classiﬁcation of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536–540.

Orengo, C. A., Jones, D. T., and Thornton, J. M. 1994. Protein superfamilies and domain superfolds. Nature 372:631–634.

Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton, J. M. 1997. CATH—A hierarchic classiﬁcation of protein domain structures. Structure 5:1093–1108.

Patthy, L. 1999. Protein Evolution. Oxford, Blackwell Science.

Ponting, C. P., and Russell, R. R. 2002. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 31:45–71.

Ponting, C. P., Schultz, J., Copley, R. R., Andrade, M. A., and Bork, P. 2000. Evolution of domain families. Adv. Protein. Chem. 54:185–244.

Richardson, J. S. 1985. Describing patterns of protein tertiary structure. Methods Enzymol. 115:341–358.

4. Domain Partitioning of Protein Structures

145

Rose, G. D. 1979. Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134:447–470.

Rossman, M. G., and Liljas, A. 1974. Letter: Recognition of structural domains in globular proteins. J. Mol. Biol. 85:177–181.

Siddiqui, A. S., and Barton, G. J. 1995. Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain deﬁnitions.

Protein Sci. 4:872–884.

Sowdhamini, R., and Blundell, T. L. 1995. An automatic method involving cluster analysis of secondary structures for the identiﬁcation of domains in proteins.

Protein Sci. 4:506–520.

Swindells, M. B. 1995a. A procedure for detecting structural domains in proteins.

Protein Sci. 4:103–112.

Swindells, M. B. 1995b. A procedure for the automatic determination of hydrophobic cores in protein structures. Protein Sci. 4:93–102.

Taylor, W. R. 1999. Protein structural domain identiﬁcation. Protein Eng. 12:203– 216.

Todd, A. E., Orengo, C. A., and Thornton, J. M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307:1113–1143.

Veretnik, S., Bourne, P. E., Alexandrov, N. N., and Shindyalov, I. N. 2004. Toward consistent assignment of structural domains in proteins. J. Mol. Biol. 339:647– 678.

Wernisch, L., Hunting, M., and Wodak, S. J. 1999. Identiﬁcation of structural domains in proteins by a graph heuristic. Proteins 35:338–352.

Wernisch, L., and Wodak, S. J. 2003. Identifying structural domains in proteins.

Methods Biochem. Anal. 44:365–385.

Wetlaufer, D. B. 1973. Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl. Acad. Sci. USA 70:697–701.

Wetlaufer, D. B., and Ristow, S. 1973. Acquisition of three-dimensional structure of proteins. Annu. Rev. Biochem. 42:135–158.

Wodak, S. J., and Janin, J. 1981. Location of structural domains in protein. Biochemistry 20:6544–6552.

Xu, Y., Xu, D., and Gabow, H. N. 2000. Protein domain decomposition using a graph-theoretic approach. Bioinformatics 16:1091–1104.

Xuan, Z. Y., Ling, L. J., and Chen, R. S. 2000. A new method for protein domain recognition. Eur. Biophys. J. 29:7–16.

<<< < Предыдущая 4 5 6 7 8 9 10 11 12 13 14 1516 / 4116 17 18 19 20 21 22 23 24 25 26 27 28 > Следующая >>>

Соседние файлы в предмете Биомеханика

#
10.08.201314.05 Mб69Biomolecular Sensing Processing and Analysis - Rashid Bashir and Steve Wereley.pdf
#
10.08.201320.53 Mб86Bioreaction Engineering Principles - Jens Nielsen.pdf
#
10.08.201326.55 Mб119Bioregenerative Engineering Principles and Applications - Shu Q. Liu..pdf
#
10.08.20134.43 Mб404Biosignal and Biomedical Image Processing MATLAB based Applications - John L. Semmlow.pdf
#
10.08.20133.76 Mб65Biotechnology for Biomedical Engineers - Martin L. Yarmush et al.pdf
#
10.08.201310.5 Mб61Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang.pdf
#
10.08.201330.78 Mб43CRC Press - Biomedical Photonics Handbook.pdf
#
10.08.20134.33 Mб60Cytoskeletal Mechanics - Mofrad and Kamm.pdf
#
10.08.20133.42 Mб67E coli in Motion - Howard C. Berg.pdf
#
10.08.201316.8 Mб59Engineering and Manufacturing for Biotechnology - Marcel Hofman & Philippe Thonart.pdf
#
10.08.20137.9 Mб234Environmental Biotechnology - Jordening and Winter.pdf