Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Genomics and Proteomics Engineering in Medicine and Biology - Metin Akay

.pdf
Скачиваний:
50
Добавлен:
10.08.2013
Размер:
9.65 Mб
Скачать

204 ERROR CONTROL CODES AND THE GENOME

processes (annealing, ligation, etc.) to compute a solution [62]. Researchers have proposed several applications using the DNA computing framework including algorithms for breaking the data encryption standard, DNA encryption methods, and techniques to investigate nature’s cellular computing processes [62–65]. In DNA computing, the information storage capability of DNA is combined with laboratory techniques that manipulate the DNA to perform computations [65]. A major challenge in DNA computing is the problem of encoding algorithms into the DNA medium. Kari and colleagues apply circular coding methods to the forwardencoding problem for DNA computing applications [65], the forward problem being how can one encode an algorithm using DNA such that one avoids undesirable folding. Kari et al. use coding theory to define heuristics for constructing codewords for DNA computing applications. The codewords cannot form undesirable bonds with itself or other codewords used or produced during the computational process. Error control and correction in DNA computing are also being investigated [66, 67].

7.4.3. Error Control Coding Methods for Genomic Sequence and System Analysis

Application of coding theory to genetic data dates back to the late 1950s [68, 69] with the deciphering of the genetic code. Since then, ECC methods have been applied to genetic sequence analysis and classification, biological chip design, as well as analysis of genetic regulatory processes. Sengupta and Tompa approach the problem of oligo array design from a combinatorial design framework and use ECC methods to increase the fidelity of oligo array construction [70]. Reif and LaBean propose ECC-based methods for the development of error correction strands for repairing errors in DNA chips [71].

Based on the discriminating behavior of their ECC model, May et al. constructed four Bayesian classifiers to distinguish between valid and invalid ribosome binding sites [49, 72]. Their classification system uses an 11-base classification window, which is a relatively small decision window compared to other classification systems [73–75]. Similar to the biological model, the error-control-based classifiers use the redundancy, or extra information, present in the mRNA leader sequence to locate valid translation initiation sites. May et al.’s optimal classification system has a correct classification rate of 73.81% with true positive and true negative rates of 67.90% and 79.72%, respectively. Their results suggest that it is highly possible to implement an ECC-based classification system for detecting and possibly designing prokaryotic translation initiation sites. Elucidating how genetic systems incorporate and use redundancy and understanding the functional significance of genetic errors from a coding theory perspective will help provide insight into the fundamental rules which govern genetic regulatory systems.

ACKNOWLEDGMENTS

The work described is a result of research performed in close collaboration with Drs. Mladen A. Vouk and Donald L. Bitzer of North Carolina State University’s Computer Science

REFERENCES 205

Department, Dr. David I. Rosnick (while at North Carolina State University), and Mr. David C. Schmidt, a Department of Energy Computational Sciences Graduate Fellow at the University of Illinois—UC. The author would like to thank Drs. Anna M. Johnston (formerly with Sandia National Laboratories) and William E. Hart and Jean-Paul Watson of Sandia National Laboratories for their contributions to this work.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

REFERENCES

1.R. Roman-Roldan, P. Bernaola-Galvan, and J. L. Oliver, “Application of information theory to DNA sequence analysis: A review,” Pattern Recognition, 29(7): 1187–1194, 1996.

2.R. Sarkar, A. B. Roy, and P. K. Sarkar, “Topological information content of genetic molecules—I,” Math. Biosci., 39: 299–312, 1978.

3.T. B. Fowler, “Computation as a thermodynamic process applied to biological systems,”

Int. J. Bio-Med. Comput., 10(6): 477–489, 1979.

4.C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, IL, 1949.

5.K. Palaniappan and M. E. Jernigan, “Pattern analysis of biological sequences,” in Proceedings of the 1984 IEEE International Conference on Systems, Man, and Cybernetics, Halifax, Canada, 1984, pp. 421–426.

6.H. Almagor, “Nucleotide distribution and the recognition of coding regions in DNA sequences: An information theory approach,” J. Theor. Biol., 117: 127–136, 1985.

7.T. D. Schneider, “Theory of molecular machines. II. Energy dissipation from molecular machines,” J. Theor. Biol., 148: 125–137, 1991.

8.T. D. Schneider, “Theory of molecular machines. I. Channel capacity of molecular machines,” J. Theor. Biol., 148: 83–123, 1991.

9.S. F. Altschul, “Amino acid substitution matrices from an information theoretic perspective,” J. Mol. Biol., 219: 555–565, 1991.

10.P. Salamon and A. K. Konopka, “A maximum entropy principle for the distribution of local complexity in naturally occuring nucleotide sequences,” Comput. Chem., 16(2): 117–124, 1992.

11.J. L. Oliver, P. Bernaola-Galvan, J. Guerrero-Garcia, and R. Roman-Roldan, “Entropic profiles of DNA sequences through chaos-game-derived images,” J. Theor. Biol., 160: 457–470, 1993.

12.F. M. De La vega, C. Cerpa, and G. Guarneros, “A mutual information analysis of tRNA sequence and modification patterns distinctive of species and phylogenetic domain,” in

Pacific Symposium on Biocomputing, 1996, pp. 710–711.

13.T. D. Schneider and D. N. Mastronarde, “Fast multiple alignment of ungapped DNA sequences using information theory and a relaxation method,” Discrete Appl. Math., 71: 259–268, 1996.

14.B. J. Strait and T. Gregory Dewey, “The Shannon information entropy of protein sequences,” Biophys. J., 71: 148–155, 1996.

206ERROR CONTROL CODES AND THE GENOME

15.A. Pavesi, B. De Iaco, M. Ilde Granero, and A. Porati, “On the informational content of overlapping genes in prokaryotic and eukaryotic viruses,” J. Mol. Evol., 44(6): 625–631, 1997.

16.D. Loewenstern and P. N. Yianilos, “Significantly lower entropy estimates for natural DNA sequences,” in Proceedings of the Data Compression Conference, Snowbird, Utah, IEEE Press, Piscataway, NJ, 1997.

17.T. D. Schneider, “Information content of individual genetic sequences,” J. Theor. Biol., 189: 427–441, 1997.

18.T. D. Schneider, “Measuring molecular information,” J. Theor. Biol., 201: 87–92, 1999.

19.L. L. Gatlin, Information Theory and the Living System, Columbia University Press, New York, 1972.

20.H. Yockey, Information Theory and Molecular Biology, Cambridge University Press, New York, 1992.

21.M. Eigen, “The origin of genetic information: Viruses as models,” Gene, 135: 37–47, 1993.

22.P. Sweeney, Error Control Coding: An Introduction, Prentice-Hall, Englewood Cliffs, NJ, 1991.

23.J. B. Anderson and S. Mohan, Source and Channel Coding: An Algorithmic Approach, Kluwer Academic, Norwell, MA, 1991.

24.A. Dholakia, Introduction to Convolutional Codes with Applications, Kluwer Academic, Norwell, MA, 1994.

25.S. Lin and D. J. Costello Jr., Error Control Coding: Fundamentals and Applications, Prentice-Hall, Englewood Cliffs, NJ, 1983.

26.E. E. May, M. A. Vouk, D. L. Bitzer, and D. I. Rosnick, “An error-correcting code framework for genetic sequence analysis,” J. Franklin Inst., 341: 89–109, 2004.

27.A. Dholakia, D. L. Bitzer, and M. A. Vouk, “Table based decoding of rate one-half convolutional codes,” IEEE Trans. Commun., 43: 681–686, 1995.

28.D. L. Bitzer, A. Dholakia, H. Koorapaty, and M. A. Vouk, “On locally invertible rate-1/n convolutional encoders,” IEEE Trans. Inform. Theory, 44: 420–422, 1998.

29.E. E. May, “Towards a biological coding theory discipline,” New Thesis, January 2004.

30.G. Battail, “Does information theory explain biological evolution?” Europhys. Lett., 40(3): 343–348, November 1997.

31.E. May, M. Vouk, D. Bitzer, and D. Rosnick, “Coding theory based models for protein translation initiation in prokaryotic organisms,” BioSystems, 76: 249–260, 2004.

32.B. Lewin, Genes, Vol. 5, Oxford University Press, New York, 1995.

33.E. Eni May, Analysis of Coding Theory Based Models for Initiating Protein Translation in Prokaryotic Organisms, Ph.D. Thesis, North Carolina State University, Raleigh, NC, March 2002.

34.G. Rosen and J. Moore, “Investigation of coding structure in DNA,” in ICASSP 2003, Hong Kong, 2003.

35.J. Watson, N. Hopkins, J. Roberts, J. Steitz, and A. Weiner, Molecular Biology of the Gene, Benjamin Cummings Publishing Company, Menlo Park, CA, 1987.

36.E. Szathmary, “The origin of the genetic code: Amino acids as cofactors in an RNA world,” Trends Genet., 16(1): 17–19, 2000.

37.R. D. Knight and L. F. Landweber, “Guilt by association: The arginine case revisited,” RNA, 6(4): 499–510, 2000.

REFERENCES 207

38.L. Gold and G. Stormo, “Translational initiation,” in Escherichia coli and Salmonella typhimurium, Cellular and Molecular Biology, in F. C. Neidhardt (Ed.), ASM Press, Washington, D.C., 1987, 1302–1307.

39.D. E. Draper, “Translation initiation,” in Escherichia coli and Salmonella, Cellular and Molecular Biology, in F. C. Neidhardt (Ed.), ASM Press, Washington, D.C., 1996, pp. 902–908.

40.T. D. Schneider, G. D. Stormo, L. Gold, and A. Dhrenfeucht, “Information content of binding sites on nucleotide sequences,” J. Mol. Biol., 188: 415–431, 1986.

41.L. S. Liebovitch, Y. Tao, A. Todorov, and L. Levine, “Is there an error correcting code in DNA?” Biophys. J., 71: 1539–1544, 1996.

42.D. MacDonaill, “A parity code interpretation of nucleotide alphabet composition,” Chem. Commun., 18: 2062–2063, 2002.

43.T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991.

44.D. C. Schmidt and E. E. May, “Visualizing ECC properties of E. coli K-12 translation initiation sites,” Paper presented at the Workshop on Genomic Signal Processing and Statistics, Baltimore, MD, 2004.

45.J. W. Drake, B. Charlesworth, D. Charlesworth, and J. F. Crow, “Rates of spontaneous mutation,” Genetics, 148: 1667–1686, 1998.

46.J. W. Drake, “A constant rate of spontaneous mutation in DNA-based microbes,” Proc. Natl. Acad. Sci., 88: 7160–7164, 1991.

47.A. Bebenek, G. T. Carver, H. Kloos Dressman, F. A. Kadyrov, J. K. Haseman, V. Petrov, W. H. Konigsberg, J. D. Karam, and J. W. Drake, “Dissecting the fidelity of bacteriophage RB69 DNA polymerase: Site-specific modulation of fidelity by polymerase accessory proteins,” Genetics, 162: 1003–1018, 2002.

48.E. E. May, M. A. Vouk, D. L. Bitzer, and D. I. Rosnick, “Coding model for translation in E. coli K-12,” Paper presented at the First Joint Conference of EMBS-BMES, Atlanta, GA, 1999.

49.E. E. May, M. A. Vouk, D. L. Bitzer, and D. I. Rosnick, “Coding theory based models for protein translation initiation in prokaryotic organisms,” BioSystems, 76: 249–260, 2004.

50.E. E. May, “Comparative analysis of information based models for initiating protein translation in Escherichia coli K-12,” M.S. Thesis, North Carolina State University, December 1998.

51.R. Kotrys and P. Remlein, “The genetic algorithm used in search of the good TCM codes,” Paper presented at the Fourth International Workshop on Systems, Signals and Image Processing, IWSSIP’97, Poznan, Poland, 1997, pp. 53–57.

52.D. L. Bitzer and M. A. Vouk, “A table-driven (feedback) decoder,” Paper presented at the Tenth Annual International Phoenix Conference on Computers and Communications, 1991, Phoenix, Arizona, pp. 385–392.

53.D. I. Rosnick, Free Energy Periodicity and Memory Model for E. coli Codings, Ph.D. Thesis, North Carolina State University, Raleigh, NC, 2001.

54.D. G. Arques and C. J. Michel, “A code in the protein coding genes,” BioSystems, 44: 107–134, 1997.

55.N. Stambuk, “On circular coding properties of gene and protein sequences,” Croatica Chem. Acta, 72(4): 999–1008, 1999.

56.N. Stambuk, “Symbolic Cantor Algorithm (SCA): A method for analysis of gene and protein coding,” Periodicum Biologorum, 101(4): 355–361, 1999.

208 ERROR CONTROL CODES AND THE GENOME

57.E. R. Berlekamp, Algebraic Coding Theory, McGraw-Hill Book Company, New York, 1968.

58.D. R. Powell, D. L. Dowe, L. Allison L, and T. I. Dix, “Discovering simple DNA sequences by compression,” Paper presented at the Pacific Symposium on Biocomputing, Maui, Hawaii, 1998, pp. 597–608.

59.D. M. Loewenstern, H. M. Berman, and H. Hirsh, “Maximum a posteriori classification of DNA structure from sequence information,” Paper presented at the Pacific Symposium on Biocomputing, Maui, Hawaii, 1998, pp. 669–680.

60.D. Loewenstern and P. N. Yianilos, “Significantly lower entropy estimates for natural DNA sequences,” J. Comput. Biol., 6(1): 125–142, 1999.

61.O. Delgrange, M. Dauchet, and E. Rivals, “Location of repetitive regions in sequences by optimizing a compression method,” Paper presented at the Pacific Symposium on Biocomputing, Big Island, Hawaii, 1999, pp. 254–265.

62.C. C. Maley, “DNA computation: Theory, practice, and prospects,” Evol. Comput., 6(3): 201–229, 1998.

63.L. Adleman, P. Rothemund, S. Roweis, and E. Winfree, “On applying molecular computation to the data encryption standard,” J. Comput. Biol., 6(1): 53–63, 1999.

64.L. F. Landweber and L. Kari, “The evolution of cellular computing: Nature’s solution to a computational problem,” BioSystems, 52: 3–13, 1999.

65.L. Kari, J. Kari, and L. F. Landweber, “Reversible molecular computation in ciliates,” in

J.Karhumaki, H. Maurer, G. Paun, and G. Rozenberg (Eds.), Jewels are Forever, Springer-Verlag, Berlin, 1999, pp. 353–363.

66.D. Boneh, C. Dunworth, R. J. Lipton, and J. Sgall, “Making DNA computers error resistant,” in L. F. Landweber and E. B. Baum (Eds.), DNA Based Computers II, American Mathematical Society, Providence, RI, 44: 165–172, 1998.

67.D. H. Wood, “Applying error correcting codes to DNA computing,” Paper presented at the Fourth DIMACS Workshop on DNA Based Computers, Philadelphia, 1998.

68.B. Hayes, “The invention of the genetic code,” Am. Sci., 86(1): 8–14, 1998.

69.S. W. Golomb, “Efficient coding for the deoxyribonucleic channel,” Proc. Symp. Appl. Math., 14: 87–100, 1962.

70.R. Sengupta and M. Tompa, “Quality control in manufacturing oligo arrays: A combinatorial design approach,” J. Comput. Biol., 9(1): 1–22, 2002.

71.J. Reif and T. LaBean, “Computationally inspired biotechnologies: Improved DNA synthesis and associative search using error-correcting codes and vector-quantization,” Paper presented at DNA Computing: Sixth International Meeting on DNA-Based Computers, Leiden, Netherlands, 2000.

72.E. E. May, M. A. Vouk, D. L. Bitzer, and D. I. Rosnick, “Coding theory based maximumlikelihood classification of translation initiation regions in Escherichia coli K-12,” Paper presented at the 2000 Biomedical Engineering Society Annual Meeting, Seattle, WA, 2000.

73.J. Henderson, S. Salzberg, and K. H. Fasman, “Finding genes in DNA with a hidden markov model,” J. Comput. Biol., 4(2): 127–1441, 1997.

74.A. Krogh, I. Saira Mian, and D. Haussler, “A hidden Markov model that finds genes in

E.coli DNA,” Nucleic Acids Res., 22(22): 4768–4778, 1994.

75.M. Borodovsky and J. McIninch, “GENMARK: Parallel gene recognition for both DNA strands,” Comput. Chem., 17(2): 123–133, 1993.

&CHAPTER 8

Complex Life Science

Multidatabase Queries

ZINA BEN MILED, NIANHUA LI, YUE HE, MALIKA MAHOUI, and OMRAN BUKHRES

8.1. INTRODUCTION

The confederation of widely distributed life science databases is a critical scientific problem which lies at the foundation of the nascent field of proteomics and of biological and biomedical phenotype analysis. Databases are the intermediaries between experimental observation and our ability to synthesize larger scale understanding of biological systems and the manifold interactions in which they participate and to which they respond. Unfortunately, the relevant databases have been organized around disciplinary interests, subject areas, or institutional convenience.

Collins et al. state, “The central information technology problems of the next decade will be the creation of a means through which to query a heterogeneous set of life science databases, generally via the Internet” [1, p. 688]. In order to query multiple life science databases, support for the interoperability among these databases should be provided. Facilitating the interoperability of heterogeneous, autonomous, geographically distributed, and/or semistructured Web databases (e.g., life science databases) is currently an emerging and active area of research [2–5]. The challenges arise from the large volume of the life science data, the rate at which these data are increasing in size, and the heterogeneity and complexity of the data format. There are also hundreds of life science databases which use different nomenclatures, file formats, and data access interfaces.

There are three different approaches to addressing the issue of interoperability among biological databases: data warehousing, link driven, and federated databases.

In data warehousing, remote databases are copied on a local server and a unique interface is built to allow multidatabase queries to be issued using this single interface. Data warehousing is based on a thorough understanding of the data schema of

Genomics and Proteomics Engineering in Medicine and Biology. Edited by Metin Akay Copyright # 2007 the Institute of Electrical and Electronics Engineers, Inc.

209

210 COMPLEX LIFE SCIENCE MULTIDATABASE QUERIES

each individual database included in the data warehouse. Specification of the data schema is not often published for life science databases. Furthermore, as data continue to grow, data warehousing becomes impractical for biological studies that extend beyond a few databases. Several existing projects perform data integration by use of data warehousing such as Genomics Unified Schema (GUS) [6].

The link-driven approach is based on providing static links between data or records in different databases. This approach often suffers from the fact that many cross references between various records in the databases are unidirectional. Additionally, the queries that can be issued by the users are limited to the scope predefined by the static links. The link-driven approach is used by several life science systems such as Sequence Retrival System (SRS) [7] and the National Center for Biotechnology Information (NCBI)/Entrez [8], which, for example, links sequence data from NCBI [9] to structure data from the Protein Data Bank (PDB) [10]. One of the major advantages of this approach is its ease of implementation.

Database federation has been studied in the context of relational and homogeneous databases [11] (e.g., business databases). More recently, TAMBIS [12], BioKleisli [13], and DiscoveryLink [14] have extended this approach to the life sciences. Federated databases provide a unified portal for the data in all the databases participating in the integration while preserving the local autonomy of these databases. This approach is based on the semantic understanding of the relationships between different objects in different databases. This feature is particularly important in the life sciences because of the complexity of the domain. Furthermore, there is no limit to the scope of the queries that can be issued against federated databases.

For each of the above-mentioned three approaches, a selected system that represents one of the leading efforts in the field is described next. The selected systems are GUS for data warehousing, SRS for link driven, and TAMBIS for federated databases. There are several other systems that aim at providing the users with a single-access interface to multiple biological databases. Some of these systems are discussed in Section 8.4.

GUS uses a data-warehousing approach to support the interoperability of life science databases. Sequences and their associated annotation are stored in GUS. These data are obtained from several sequence databases such as SwissProt and stored locally using a relational database model. The design of GUS addresses one of the major issues associated with data warehousing, which is the automatic update of the data warehouse. This process includes detecting when changes occur in the databases and subsequently how to initiate the necessary updates automatically in the data warehouse.

SRS links several biological data sources and answers queries through a single front-end Web interface. The data sources are not transparent to the user in SRS. The user must specify which data sources should be used in order to answer a given query. This lack of transparency requires that the user be familiar with the various data sources integrated by SRS and their content. Data sources in SRS are semistructured flat files. Each of the data files is associated with a wrapper that extracts information in the form of objects from the corresponding data source. SRS also includes metadata that describe the various data sources.

8.1. INTRODUCTION

211

TAMBIS integrates a specific set of life science databases that consists of protein and nucleic acid data sources. It is based on a global domain ontology, which consists of more than 1800 biological concepts. These concepts are used to express the multidatabase queries and to decompose these queries into subqueries that can be mapped to specific data sources. Unlike SRS, the integrated databases are transparent to the user.

TAMBIS uses the federated database approach to support the interoperability among biological databases. As mentioned earlier, database federation benefits from several enabling capabilities. However, these capabilities come at the expense of a complex design. One of these complexities arises from answering multidatabase queries given the limited query capabilities of the individual databases participating in the integration.

Consider, for example, the following query: What is the 3D structure of all alcohol dehydrogenases that belong to the enzyme family EC 1.1.1.1 and is located within the human chromosome region 4q21–4q23? This example query, among others, motivated the design of BACIIS [15], the Biological and Chemical Information Integration System being discussed in this chapter. BACIIS uses a federated database approach similar to TAMBIS. The above example query is difficult because some of the integrated databases may not support enzyme family as an acceptable keyword while others may not allow queries that use the logical AND operation to combine the two keywords enzyme family and human chromosome region. Additionally, some life science federated database systems such as DiscoveryLink require that the user enter all the databases that may be needed to answer a given multidatabase query. As will be discussed later in this chapter, the execution plans of complex life science queries may include intermediary databases that cannot directly accept the input or the output constraints of the user query. These intermediary databases are used to translate the attributes of one database to the attributes of another database. For example, SwissProt [16] can accept Online Mendelian Inheritance in Man (OMIM) [17] numbers and generate PDB IDs. Requiring that the user enters all the databases that may be involved in executing a query is not practical for life science databases since the user may not be familiar with the numerous biological databases (more than 400) that are available. BACIIS uses an approach that allows the execution of this query and other complex queries in a completely transparent manner while hiding the complexities from the user and preserving the autonomy of each database participating in the integration.

The aim of this chapter is to (1) introduce BACIIS and its functionality,

(2) present an efficient query execution method that can be used to execute complex multidatabase queries while preserving the local autonomy of the geographically distributed life science Web databases and while hiding the complexity of the query execution plan from the user, and (3) expose the reader to other approaches for the integration of life science databases.

The general architecture of BACIIS is introduced in Section 8.2. The mapping engine and the query planner are the main two components of BACIIS that are involved in the execution of the multidatabase queries. These components are

212 COMPLEX LIFE SCIENCE MULTIDATABASE QUERIES

discussed in Section 8.3. Related work is summarized in Section 8.4. Future trends are reviewed in Section 8.5.

8.2. ARCHITECTURE

Figure 8.1 shows the general three-tier architecture of BACIIS. This architecture hides the heterogeneity of the various databases from the user through the use of a unified user interface. In order for a user to accomplish the same results provided by BACIIS, he or she would have to perform two tasks manually: (1) query the individual Web databases manually and (2) combine the returned information. In BACIIS, the wrappers (Fig. 8.1) act as intelligent proxy users in order to extract information from the individual Web databases. They perform the first task automatically on behalf of the user while the information integration layer (Fig. 8.1) performs the second task also automatically.

The various components of the information integration layer are shown in Figure 8.2. At the core of the information integration layer is a mediator. The mediator transforms the data from its format in the source database to a common format used for all the databases within the integration layer. The use of a mediator in the information integration system is not unique to BACIIS. Mediators have been used in various previous federated database systems [18–20]. The unique feature of BACIIS is that it uses a knowledge base to guide multidatabase query formulation and execution. In BACIIS, each remote database schema is mapped onto a domain knowledge base. This domain knowledge base is independent from the data schema of the remote databases. It will only change when the biological domain evolves to include new discoveries and concepts.

FIGURE 8.1. BACIIS three-tier architecture.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

8.2. ARCHITECTURE

213

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FIGURE 8.2. Architecture of information integration layer.

The BACIIS knowledge base consists of an ontology [21] and the data source schema. The ontology is necessary for terminology consensus across the various databases within a domain. Furthermore, it provides the domain knowledge used for understanding concepts in the biological domain. Under the BACIIS knowledge base, specific data source models are defined for each database participating in the integration. These data source models describe the content of each database using the objects and the relations defined by the ontology.

The ontology is also used in the query generator (Fig. 8.2) module of the mediator to translate the user queries into domain-recognized terms from the ontology. Queries generated by the query generator module are forwarded to the query planning and execution module (Fig. 8.2) of the mediator. This module consists of the query planner, the mapping engine, and the execution engine. The query planner receives the queries generated by the query generator module and decomposes them into subqueries according to the ontology properties and relations. The query planner also defines the dependency among the subqueries. The mapping engine is used to formulate these subqueries and to determine which specific source databases contain the desired information. Based on the mapping found by