Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Physics of biomolecules and cells

.pdf
Скачиваний:
52
Добавлен:
15.08.2013
Размер:
11.59 Mб
Скачать

432

Physics of Bio-Molecules and Cells

parameter pa, given a finite sample drawn from the distribution, is not the most probable value na/N , which for instance can be 0.

To apply this to clustering, consider a large number, S, of sequences, each of length obtained by sampling M unknown frequency matrices, wai ,

where i = 1, 2.. , A1 wai = 1 (i.e., the entries in the matrix give for each column i the frequencies of the letters). The problem is then to group together the sequences from a common weight-matrix and recover, within the errors imposed by the finite sample-size, the set of wai . Consider a subset of N sequences, then the probability P (C) that they were drawn from single weight matrix is the product of I(1) over all the columns. A probability distribution can be defined over the entire set of S sequences by allowing all possible partitions into clusters, each with a weight i P (Ci). Thus there is a competition between all ways of partitioning S things into subsets and the “energy” which favors putting sequences from the same frequency matrix together, one can show. This weighting scheme can be used, either in a onepass phylogenetic clustering, or more correctly with Monte Carlo sampling which will generate soft clusters and allow an assignment of significance.

Intuitively, for given S, there is a limit to how many frequency matrices can be resolved (which depends also on their degree of polarization). Discrimination obviously improves if more samples from the same matrix are supplied. Finally, there is a very interesting regime where it is possible to classify most sequences if the set of M frequency matrices is known, yet it is impossible to cluster these sequences if one knows nothing about the matrices. The former problem is the one faced by the cell, since it “knows” the proteins which do the site-recognition, whereas sequence-clustering is only a problem for the bioinformatician.

5 Gene regulation

The extraction of the sites active in transcription control from the genome is a more daunting task than gene identification, since the individual proteinbinding sites are much smaller than typical exons, and their arrangement is not so choreographed as the promoter-exon-intron pattern of genes. Three types of data can be brought to bear on the problem, and all appear necessary. For a single genome, one can search for repetition between the regulatory regions of di erent genes. The repeats can be at the level of specific strings (perhaps with a few spelling errors) or groups of similar strings that occur in clusters. In all cases it is assumed that improbability under some model implies function, and for the calculations to be tractable there needs to be some vestige of the signal on scales short enough to be searched exhaustively. (The hard cases are those where the motif is long and mutated and where there is no statistically significant signal in just a few copies.)

E.D. Siggia: Some Physical Problems in Bioinformatics

433

The issue raised above, that the cell can function by merely classifyng sites while there may not be enough copies to allow clustering, is clearly relevant here. The application of one genome-wide algorithm to yeast was discussed by H. Bussemaker.

The second source of data is comparative genomics, namely we exploit the fact that what is functional is more constrained and evolves less rapidly than what is not. The protein coding regions serve as landmarks for the regulatory regions to compare since they are much larger and evolve more slowly than the regulatory sites. In reality there are merely degrees of constraint and the scale in bp on which compensatory mutations (preserving fitness) can occur is also unknown. The ideal case is individual protein binding sites immersed in a sea of random sequence. In bacteria where the total regulatory region of a gene is a few hundred bases, the conserved domains are typically larger than a single binding site (N. Rajewsky). The current state of the art (McCue and C. Lawrence) in this area is to examine the regulatory regions for one gene from several species. One is then faced with the task of clustering sites for individual genes into families recognized, one hopes, by a single protein.

Finally there remains mRNA expression data. If the question being asked is how expression follows from sequence, there is little reason to first cluster genes based on similarity of expression, and then look for common sequence motifs. The clustering should follow from the sequence. Following the idea that the polymerase which makes mRNA is recruited to the promoter by equilibrium binding to certain sites (or other proteins attached to these sites), we have fit the log of the expression ratio, Rg for gene g, to the sum of contributions Fm for motif m by minimizing g (Rg − C −

m(Fm Ng,m))2 with respect to Fm and C, where the integer Ng,m is the number of copies of motif m upstream of gene g. (H. Bussemaker). This

scheme is sensitive to combinatorial control. Genes which do not respond, but carry a functional site, are informative about potential compensatory factors. All genes are fit and when the residuals are Gaussian it is easy to assign significance to the sequence motifs that correlate with expression.

References

[1]http://www4.ncbi.nlm.nih.gov/PubMed/

[2]R. Durbin, S. Eddy, A. Kroch and G. Mitchison, Biological Sequence Analysis (Cambridge Univ. Press, 1998).

[3]http://cmgm.stanford.edu/pbrown/

[4]http://www.affymetrix.com/index.shtml

434

Physics of Bio-Molecules and Cells

[5]M.S. Waterman, Introduction to Computational Biology (Chapman & Hall N.Y., 1995).

[6]D. Gusfield, Algorithms on Strings Trees, and Sequences (Cambridge Univ Press, N.Y., 1997).

[7]http://www.ncbi.nlm.nih.gov:80/BLAST/

[8]http://www.people.Virginia.EDU/wrp/pearson.html

[9]http://www-groups.dcs.st-andrews.ac.uk/history/Mathematicians/Bayes.html

COURSE 10

THREE LECTURES ON BIOLOGICAL NETWORKS

M.O. MAGNASCO

Center for Studies in Physics and Biology, The Rockefeller University, 1230 York Avenue, New York,

NY 10021, USA

Contents

 

1 Enzymatic networks. Proofreading knots: How DNA topoisomerases

 

disentangle DNA

438

1.1

Length scales and energy scales . . . . . . . . . . . . . . . . . . . .

439

1.2

DNA topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

440

1.3

Topoisomerases . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

441

1.4

Knots and supercoils . . . . . . . . . . . . . . . . . . . . . . . . . .

444

1.5

Topological equilibrium . . . . . . . . . . . . . . . . . . . . . . . .

446

1.6

Can topoisomerases recognize topology? . . . . . . . . . . . . . . .

447

1.7

Proposal: Kinetic proofreading . . . . . . . . . . . . . . . . . . . .

448

1.8

How to do it twice . . . . . . . . . . . . . . . . . . . . . . . . . . .

449

1.9

The care and proofreading of knots . . . . . . . . . . . . . . . . . .

451

1.10

Suppression of supercoils . . . . . . . . . . . . . . . . . . . . . . . .

453

1.11

Problems and outlook . . . . . . . . . . . . . . . . . . . . . . . . .

455

1.12

Disquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

457

2 Gene expression networks. Methods for analysis of DNA chip

 

experiments

457

2.1

The regulation of gene expression . . . . . . . . . . . . . . . . . .

457

2.2

Gene expression arrays . . . . . . . . . . . . . . . . . . . . . . . . .

460

2.3

Analysis of array data . . . . . . . . . . . . . . . . . . . . . . . . .

463

2.4

Some simplifying assumptions . . . . . . . . . . . . . . . . . . . . .

464

2.5

Probeset analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

466

2.6

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

470

3 Neural and gene expression networks: Song-induced gene expression

 

in the canary brain

471

3.1

The study of songbirds . . . . . . . . . . . . . . . . . . . . . . . . .

472

3.2

Canary song . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

473

3.3

ZENK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

474

3.4

The blush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

476

3.5

Histological analysis . . . . . . . . . . . . . . . . . . . . . . . . . .

476

3.6

Natural vs. artificial . . . . . . . . . . . . . . . . . . . . . . . . . .

479

3.7

The Blush II: gAP . . . . . . . . . . . . . . . . . . . . . . . . . . .

480

3.8

Meditation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

481

THREE LECTURES ON BIOLOGICAL NETWORKS

M.O. Magnasco

Abstract

This course comprised three lectures whose uniting thread was the study of some biological networks of interest to the physicist: enzymatic, gene transcription and neural.

Lecture 1: Enzymatic networks. Proofreading knots: A conjecture on how DNA topoisomerases disentangle DNA. It is vitally important to living cells to be able to manage the topology of their DNA. Topoisomerases are the enzymes in charge of handling knotting and supercoiling of DNA. It was believed for a long time that they did so by permitting random strand passage, rendering DNA effectively a ghost-like polymer. But it has been shown experimentally that this is not so: topoisomerases do quite substantially better than random strand passage. This then begs the question of how an enzyme may survey the topology of a DNA strand thousands of times larger than itself. We discuss some possible mechanisms for this.

Lecture 2: Gene Expression Networks. Methods for analysis of DNA chip experiments. We outline and discuss the most important features behind “gene chips”, hybridization arrays in widespread use for gene expression. We concentrate on one of the two most popular technologies, the GeneChip arrays. We discuss various methods for reconstructing RNA concentrations from the measured fluorescence in the arrays.

Lecture 3: Neural and gene expression networks. Songinduced gene expression in the canary brain. We outline the basic features of immediate early gene expression following stimulation. We show how it has been used, via large-scale histological mapping, to dissect the rules of representation of song elements in canary brain.

These lectures are about pieces of research with which I have been personally involved over the past few years. Their subject matter goes from enzymology, through bioinformatics, to neuroscience; and spans purely theoretical work, through data analysis, to experimental work. Yet though

c EDP Sciences, Springer-Verlag 2002

438

Physics of Bio-Molecules and Cells

apparently dissimilar they do have an overarching theme: the study of biological networks, in their various incarnations and hugely diverse timescales and lengthscales. Networks are interesting, in that their function does not reside in any given part, but in the way the whole is assembled together–too many times do we think that biological function is carried out by a specific and single-purpose piece of hardware which does “just that”.

I opened with a theme with which physics and mathematical audiences would feel probably more at ease. Topoisomerases are enzymes which manage the topology of DNA, a vital function to living beings. Experimental evidence shows that topoisomerases are somehow able to unknot DNA better than random strand crossing–i.e., they obtain information about the topology of DNA. How an enzyme may survey the topology of an object several orders of magnitude larger is a clearly defined problem for a physicist. That it happens to be of some biological importance does not detract from the clarity of the definition. One could conceive a machine of measuring topology. But one could also conceive that the measurement is accomplished by the enzymatic network of chemical reactions defined by the dynamics of the enzyme, and this is the possibility explored here.

The next subject was meant to exemplify the dire need for analyticallyminded people in the gene expression network area. Gene expression arrays, or gene chips, have become hugely popular tools to try to infer gene regulation interactions. I describe a bit the general ideas and then plunge into a problem we studied in detail, that of obtaining a measurement of concentration from the GeneChip arrays manufactured by A ymetrix.

I close with what I consider to the be frontier, in every sense. Neuroscience is, to my mind, the most deeply fascinating branch of science; also the most deeply disturbing. For a physicist, the extent to which the discussion is ill-defined is simply unsettling; yet the mystery is so deep that one cannot but feel excited and awed. I hope I have been able to illustrate this by means of our studies of representations of song fragments in the brain of canaries, a study which we carried out by looking at gene expression in the auditory nuclei.

1Enzymatic networks. Proofreading knots: How DNA topoisomerases disentangle DNA

In many instances, biochemistry shows us little specific machines which undertake a particular job: they cut one specific bond in one specific configuration, or they take such an arrangement of atoms and rearrange it exactly thus. So it is usually thought that a specific job is carried out by the enzyme, just like a little clockwork thingie; and so the doctrine of one enzyme, one function evolved. On the other hand, we don’t think of

M.O. Magnasco: Three Lectures on Biological Networks

439

thought as something done on a single neuron level: it is a task collectively carried out by a network of neurons. In this lecture I will tell the story of something in between: a very specific job, the untangling of DNA, which may be a job carried out, not by a specific machine doing one specific rearrangement, but by a network. There is, of course, an underlying machine, the topoisomerase, which is the one carrying out the strand rearrangements necessary to actually change the topology; but untangling does not entail strand crossings alone, but also the ability to discriminate the topology of the system being untangled. Such discrimination between the knotted and unknotted states could not have been done by one particular machine; we propose that it is, rather, a property of the network of chemical reactions the topoisomerase carries out. The network of chemical reactions is structurally similar, and quantitatively behaves similarly, to the kinetic proofreading reaction networks proposed by Hopfield and Ninio to understand the specificity of biological polymer replication–hence the title of this lecture.

1.1 Length scales and energy scales

DNA is a long and thin polymer, and living beings carry a whole lot of it. In bacteria, as a rough guide, DNA is about 1000 times longer and 1000 times thinner than a typical bacterium. So if E. coli was the size of a small classroom, about 5 m, its DNA payload would be about 5 kilometers’ worth of 5-mm-thick wire; or about as much ethernet wire as there is in a whole small building. A polymer in a fluid freely fluctuates under the action of thermal agitation, bending and writhing to the extent compatible with thermal energetics. A natural comparison between its bending sti ness and thermal energy scale may be introduced by a lengthscale, called the persistence length, which intuitively is the “typical” radius of curvature of the polymer strand when agitated by thermal motion. It is defined, equivalently, as the correlation length of the tangent vector to the polymer, or as the length of polymer that can be bent in a circular arc of a radian angle change with a cost of 1 kT worth of bending energy. The persistence length of DNA is less than a tenth of the size of a typical bacterium. Thus DNA is quite flexible within the scale of the bacterium, and it can be easily fit within one, as far as elastic energy is concerned. On the other hand, if left to its own devices, the bacterial genome would form a Gaussian loop of string about 10 times the size of the bacterium (10 000 persistence lengths’ worth of DNA would like to random walk around a heap of approx 100 persistence lengths in diameter, or 10 times the size of the bacterium).

Stu ng a lot of DNA within the small confines of a bacterium is a problem, as can be seen in electron micrographs showing punctured bacteria: DNA literally geysers out of them by the loopful. But as we have argued

440

Physics of Bio-Molecules and Cells

above, confining DNA to the inside of a bacterium does not poise an elastic energy problem, but an entropic problem. This unusual characteristic will stay with us through this lecture: pretty much everything which we shall discuss below carries the strange flavor that bending is not so much the issue as confining entropically to a small region.

1.2 DNA topology

So there’s a lot of DNA and it is bending and writhing under the kicking of thermal energy, which is a recipe for a topological nightmare. In addition to this, it should be remembered that DNA is a double helix. The individual strands are twisted around one another, about 1 turn per 10 base pairs, or half a million turns for E. coli. This means one strand may not be easily separated from the other. They could, in principle, if the ends were free–but free ends of DNA are chemically fragile and hard to replicate, so bacteria either carry their DNA in a loop, or they anchor DNA free ends to the wall. (We have a sophisticated structure in place at the free ends of our DNA, the telomere, deserving much lengthier description). In either case, the ends are rigidly held, so we can speak about the topology of the DNA: while a knot may be smoothly removed from a piece of string with free ends (which is the reason one should first locate the free ends in order to untangle a knot) a loop of string, or a piece of string tethered to a wall, has no free ends, and so a knot may not be smoothly removed from it. If in addition, the string has an internal structure such as being made of two twisted cables, then the number of links between the cables is a topological invariant.

In order to reproduce, a bacterium needs to duplicate its DNA. To do so, it shall make a copy of each of the two strands. But in order for each cell to go its separate way, the two strands need to be separated from one another–unlinked, in topological terms. Thus it is necessary, just to be able to reproduce, to perform topological operations; at least a few hundred thousand per reproductive cycle, since there are about half a million links between E. coli strands. These operations must be performed with utmost care, as is anything a ecting the integrity of the DNA backbone.

The enzymes charged with managing the topology in DNA in living cells are called topoisomerases. The name itself requires an explanation. In biochemistry talk, a widdigly-ase is an enzyme charged with catalyzing a reaction of widdigly into something else; i.e., widdigly is a substrate of the widdiglase. You don’t choose to name the enzyme after whatever is in common between the before and after the catalysis states, but after the most prominent di erence, that which has something to do with whatever is it that the enzyme has changed, with the reaction that took place. Now, sometimes you have stu that is identical except in some particular way: such stu are called isomers. In particular, topoisomers are two or more

M.O. Magnasco: Three Lectures on Biological Networks

441

structures which are identical except in topological terms. A 10 572 base pair loop of DNA in an untangled configuration, and the identical strand tangled in a knot cannot be considered the same chemical substance, since there is no transformation short of cutting the covalent bonds of the backbone, untangling and then soldering the covalent bonds again that will change one into the other. They are thus called topoisomers, since they are identical except for their topology. An enzyme whose job is the catalytic transformation of one topoisomer into another is, thus, called a topoisomerase. Then another deep-seated tradition steps in, and topoisomerases become “topos”.

1.3 Topoisomerases

Topoisomerases are classified according to whether they do or do not use energy to perform their jobs, a very basic distinction between enzymes in the biological world. Enzymes which do not use energy are catalysts, passively accelerating the reaction rates of reactions that otherwise would take place naturally. They can change the timescale of the reaction (say by binding the reactants in close-together spots in their surface that favor the reaction to take place; however, they cannot change the equilibrium probabilities of finding the reactants in this or that configuration, for this is given by the Boltzmann distribution. Once the Boltzmann equilibrium has been achieved, and in the absence of energy consumption, the discussion is that of one isothermal bath and the Second law forbids moving away from equilibrium. Powered enzymes, on the other hand, can perform chemical tasks which are thermodynamically “uphill” [8]. They do so by coupling the uphill reaction to a “downhill” reaction–the consumption of some form of fuel or energy currency–which makes the overall reaction be downhill [9]. An example is a motor protein, which consumes energy by hydrolyzing ATP (the universal energy currency in cells) and may exert mechanical work by moving against an external force. This has been a major subject during the previous Lectures, and hence I’ll refer back to them. Here I shall point out that it is not always evident what the energy being spent is being used for, or in which way; I think this lecture will provide an example.

Energetically passive topoisomerases are called “class I”, while topos which couple to energy consumption are “class II”. Class I topoisomerases catalyze the following reaction: they bind to DNA, they temporarily nick or cut the covalent backbone in one strand of DNA; they allow the free ends to rotate for a while around the un-nicked covalent bond, then solder them again. Thus they change the number of links between the two DNA chains, and they do so in the strictly energetically downhill direction: if the two strands had been overwound, type I topo relaxes this back to the equilibrium torsion. Strictly speaking, this is all that would be needed to allow a circular

Соседние файлы в предмете Химия