Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
11
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

Data Warehousing and OLAP

Table 2. A hybrid data cube/table output of a multidimensional query

 

 

2005-1

2005-2

2005-3

2005-4

2006-1

2006-2

 

Parfums

17000

12000

3000

11000

18000

12000

 

Pet meat

57000

52000

53000

56000

54000

54000

Spain

Fresh vegetables

93000

125000

131000

103000

97000

133000

Pet accessories

22000

32000

23000

19000

24000

35000

 

 

Alcoholic drinks

5000

4200

3800

7300

5300

4100

 

Garden & furniture

12000

18000

19000

15000

14000

18000

 

Parfums

23000

17000

5000

17000

17000

13000

 

Pet meat

77000

63000

72000

82000

64000

53000

France

Fresh vegetables

102000

132000

126000

110000

99000

143000

...

...

...

...

...

...

...

 

 

...

...

...

...

...

...

...

 

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

the queries). The design of data warehouse and OLAP tools are intended to avoid the recalculation of the aggregates each time one of the previous operators is applied. The idea is to precalculate the aggregations at several levels of the hierarchies and so making the traversal through the dimension paths more efficient.

Design and Implementation

One of the reasons to create a separate data repository for data analysis was to improve efficiency. The multidimensional model and the OLAP refinement operators are two powerful means towards this goal, but they must be complemented by an adequate data warehouse design and implementation.

There are, of course, many design methodologies for data warehouses (see e.g., Gardner, 1998; Inmon, 2002; Kimball, 1996; Sapia, Blaschka, Hölfling, & Dinter 1996; Trujillo, Palomar, & Gómez, 2001; Tryfona, Busborg, & Christiansen, 1999). Some of them differ significantly from the rest, using entity-relation- ship extensions, UML extensions, or specific modeling paradigms. There are also international standards, such as the OMG (object management group) common warehouse metadata (CWM) for this purpose.

The following are some hints on the most important steps for designing a data warehouse (or more precisely, each data mart), under the multidimensional model:

Choose a “process” or “domain” in the organization for which analytical processing is to be done, considering the data that complex reports, more general analysis, or data mining will require.

Choose the central fact and the granularity. Granularity should be fine-grained enough to allow every interesting report and query to be done, but not more.

Identify the dimensions that characterize the domain, its levels and hierarchies, and the basic attributes for each level. Dimensions can vary from one domain to the other, although they usually correspond to questions such as what, who, where, when, how, and so forth.

Determine and refine the measures for the facts (usually corresponding to questions such as how many or how much) and the attributes for the dimensions.

Once the data warehouse is designed, it is time to start its implementation (Weir, Peng, & Kerridge, 2003). The first important decision for implementation is to determine the kind of physical environment that will hold the information. There are two main approaches:

Relational OLAP (ROLAP). Physically, the data warehouse will be implemented with a relational database.

Multidimensional OLAP (MOLAP). Physically, the data warehouse is constructed over special structures based on multidimensional matrices.

It can be argued that the MOLAP approach should be more adequate for a data warehouse that uses a multidimensional model. However, there are pros and cons for both approaches. ROLAP has several advantages: There are many RDBMS to choose from, with their typical tools (e.g., SQL, constraints, triggers, procedures); the training and cost required for the implementation of a ROLAP data warehouse is usually lower. MOLAP has a more direct correspondence between the logical level and the physical level, and it is usually more efficient.

At least in the inception of data warehouses, the ROLAP approach has been more successful. There are many methodologies to convert a multidimensional de-

130

TEAM LinG

Data Warehousing and OLAP

sign into a relational database that implements it. The starflake structure (Kimball, 1996) is a well-known approach and is based on a central fact table, with foreign keys to base (fine-grained) dimension tables, which, in turn, refer to other, more coarse-grained dimension tables. These tables are called snowflake tables. Some additional tables (star tables) are added (redundantly) to improve efficiency.

Associated to the physical level, new technologies have been developed to make the data warehouse work effectively: tools to compute and instantiate the data cube, query optimization to select and refine queries on the data cube, specialized indexing techniques for improving aggregation, and many others.

Data Load and Refreshment

Finally, there is an important component to be considered for the implementation of a data warehouse. Once we have chosen the data we want to integrate in the data warehouse (either internal or external), we have designed the schema that will be used to organize all this data, decided the implementation over ROLAP or MOLAP and created the physical schema, we can start with the hard and true implementation work. But what is left? We have to extract and transform the data from the original sources (and from several schemas) and to load them into the data warehouse (into the new schema).

Although this process may resemble some data migration processes, which are common when one or more databases have to be moved into a new and different system, there are some specific issues which make this process even more complicated. The volume of data transferred is usually very large, the data has to be transformed deeply and comes from many different sources, some original databases cannot be disturbed by this process because they are necessary for the daily transactional work, and, finally, the data has to be refreshed periodically (i.e., the data warehouse must be kept up to date, weekly or monthly, with the new data added to the transactional sources).

The complexity of this process and the fact that it must be operating periodically motivates the construction of a system, which is called the extraction, transformation, load (ETL) system.

The ETL system must perform many tasks:

Extraction of the transactional data: This must be done in such a way that the transactional database is not disturbed (e.g., at night).

Incorporation of external data: Since the data might be unstructured, it is necessary to use wrap-

pers or scripts to convert data from text, spreadsheets, HTML, or other formats into the form D required for the data warehouse.

Key creation and metadata creation: The data moved must be assigned new keys. The correspondence between the original data and the data in the warehouse must be traceable through the use of metadata.

Data integration, cleansing, transformation, and aggregation: The data must be crossed, compared, and integrated, and most of the data is needed in some level of aggregation. It is not convenient to perform all this on the transactional databases, because this can slow the transactional work.

Change identification: The system must deal about how changes are refreshed from the sources to the data warehouse (e.g., by delta files, time stamping, triggers, log files, mixed techniques).

Load planning and maintenance: Definition of the order and stages for the load process, load time windows (to minimize impact on the transactional sources), creation of indexes on the data warehouse, and so forth. The maintenance must include periodical assessments on the quality of the data stored in the data warehouse.

Many ETL systems use an intermediate repository between the sources and the data warehouse to be able to do aggregations, integration, and cleansing, as well as other processes (metadata) without disturbing the original transactional databases or the destination data warehouse.

FUTURE TRENDS

Data warehousing technology is evolving quickly and there are many research opportunities (Dinter, Sapia, Hölfing, & Blaschka, 1999; Rizzi, 2003; Roddick, 1998; Rossopoulos, 1998;,Samtani, Kumar, & Kambayashi, 1998; Widom, 1995) on many topics, such as data models, logical and physical design, implementation (the construction of the ETL system in particular), metadata standardization, theoretical issues (e.g., materialized views, schema versioning, evolution), as well as other, more heterogeneous topics, such as data cleansing and transformation, XML metadata, OLAP query languages, the operability and friendliness of OLAP tools, and their distribution between client and server.

131

TEAM LinG

CONCLUSION

Data warehousing is an emerging technology that has spread over organizations and companies all over the world, because it is making OLAP possible (i.e., a realtime analytical processing for volumes of data, which was simply unbelievable a decade ago). The use of separate data repositories, called data warehouses, with the organization under specific data models and specific physical implementation, as well as new OLAP tools and operators have boosted the friendliness and effectiveness of reporting and other analytical tools. Data warehouses are not only used as platforms for query and report tools with an orientation to “farming” reports, but they also make a very good team with more “exploring” objectives, such as summarization and data mining.

REFERENCES

Dinter, B., Sapia, C., Hölfing, G., & Blaschka, M. (1999) OLAP market and research: Initiating the cooperation.

Journal of Computer Science and Information Management, 2(3), 23-37.

Gardner, S. R. (1998) Building the data warehouse. Communications of the ACM, 41(9), 52-60.

Golfarelli, M., Maio, D., & Rizzi, S. (1998) Dimensional fact model. International Journal of Human-Computer Studies, 43(5/6), 865-889.

Inmon, W. H. (1992). EIS and the data warehouse: A simple approach to building an effective foundation for EIS. Database Programming and Design, 5(11), 70-73.

Inmon, W. H. (2002). Building the data Warehouse (3rd Ed.). Chichester, UK: Wiley.

Kimball, R. (1996) The data warehouse toolkit: Practical techniques for building dimensional data warehouses. New York: Wiley.

Rizzi, S. (2003, September 8). Open problems in data warehousing: Eight years later (Keynote slides). In H-. J. Lenz, P. Vassiliadis, M. Jeusfeld, & M. Staudt (Eds.),

Design and management of data warehouses 2003. Proceedings of the 5th International Workshop, Berlin, Germany. Available online from http://sunsite.infor matik.rwth-aachen.de/Publications/CEUR-WS//Vol- 77/keynote.pdf

Roddick, J. F. (1998). Data warehousing and data mining: Are we working on the right things? Lecture notes in computer science. In Y. Kambayashi, D. K. Lee, E.-P. Lim, Y.Masunaga, & M. Mohania (Eds.), Advances in

Data Warehousing and OLAP

database technologies (pp. 141-144). Berlin, Germany: Springer-Verlag.

Rossopoulos, N. (1998). Materializaed views and data warehouses. SIGMOD Record, 27(1), 21-26.

Samtani, S., Kumar, V., & Kambayashi, Y. (1998) Recent advances and research problems in data warehousing. International Conference on Conceptual Modeling (ER), Singapore.

Sapia, C., Blaschka, M., Hölfling, G., & Dinter, B. (1998).

Extending the E/R model for the multidimensional paradigm. International Workshop on Data Warehouse and Data Mining (DWDM), Singapore.

Trujillo, J. C., Palomar, M., & Gómez, J. (2001). Designing data warehouses with OO conceptual models. IEEE Computer, 34(12), 66-75.

Tryfona, N., Busborg, F., & Christiansen, J. (1999). Star ER: A conceptual model for data warehouse design.

International Workshop on Data Warehousing and OLAP (DOLAP99), Kansas City, Missouri.

Weir, R., Peng, T., & Kerridge, J. (2003, September 8). Best practice for implementing a data warehouse: A review for strategic alignment. In H-J. Lenz, P. Vassiliadis, M. Jeusfeld, & M. Staudt (Eds.), Design and Management of Data Warehouses, Proceedings of the 5th International Workshop, Berlin, Germany.

Widom, J. (1995). Research problems in data warehousing. International Conference on Information and Knowledge Management.

KEY TERMS

Data Cube: A preselection of aggregated data at several levels of detail according to several dimensions in a data mart. A data cube establishes the resolution and basic projection where selections can be made. The data cube aggregation detail can be modified through OLAP operators such as drill and roll, and their dimensions can be modified by slice & dice and pivot.

Data Mart: Part of a data warehouse which gathers the information about a specific domain. Each data mart is usually viewed at the conceptual model as a multidimensional star or snowflake schema. Although each data mart is specialized on part of the organization information, some dimensions can be redundantly replicated in several data marts. For instance, time is usually a dimension shared by all data marts.

132

TEAM LinG

Data Warehousing and OLAP

ETL(Extraction,Transformation,Load)System:The system which is in charge of extracting the data from internal transactional databases and other sources (e.g., external data), transforming it to accommodate to the data warehouse schema, loading the data initially and refreshing the data periodically. The design of the ETL system is generally the most time-consuming task in the construction of a data warehouse.

Farmers and Explorers: Two typologies of data warehouse users. Farmers are users of data warehouses and other analytical tools that generate periodical reports, such as sales by week, category, and department. Explorers are more ambitious users of data warehouses which try to better understand the data, to look for patterns in the data or to generate new reports.

MOLAP (Multidimensional OLAP): Implementation of a data warehouse and associated OLAP tools over specific multidimensional data structures. This approach has a more direct mapping between the multidimensional conceptual schema and the multidimensional physical schema.

OLAP (On-Line Analytical Processing): The process of analyzing a database or data warehouse, which

consists of heavy queries for constructing reports or showing the information in a highly aggregated or com- D plex way. This kind of processing supports information retrieval and data analysis in the form of complex queries

and reports.

OLTP(On-LineTransactionalProcessing):The usual process in a transactional database, which consists of frequent queries and updates for serving the applications of an organization. This kind of processing supports the daily operations of the software applications, it is read- and-write, and generally performed through SQL queries.

ROLAP (Relational OLAP): Implementation of a data warehouse and associated OLAP tools over a relational database and RDBMS. This approach must be based on an appropriate mapping between the multidimensional conceptual schema and the relational logical/physical schema. The great advantage of this approach is that inherits the existing technology and knowledge on classical relational database systems.

Snowflake Schema: A schema for organizing multidimensional data, where we have many levels in each dimension and there can be alternative paths for aggregation. This schema is the most usual for designing data marts.

133

TEAM LinG

134

Data Warehousing, Multi-Dimensional Data

Models and OLAP

Prasad M. Deshpande

IBM Almaden Research Center, USA

Karthikeyan Ramasamy

Juniper Networks, USA

INTRODUCTION

Since the advent of information technology, businesses have been collecting vast amounts of data about their daily transactions. For example, a company keeps track of data regarding the sales of its various products at different stores over a period of time. Businesses can gain valuable insights by analyzing this data to spot trends and correlations in the data. Data warehousing, multidimensional analysis, and online analytical processing (OLAP) refer to a set of technologies that address the problem of business data analysis. Data warehousing has become the established paradigm for knowledge workers to sift through mountains of historical data in order to extract nuggets of business information. Data analysis tools have been used in various forms historically since the 1960s (Pendse, 2003). Recently, there has been a rapid growth in the industry, with the total worldwide OLAP market estimated at about $3.7 billion in 2003 (Pendse). This has also been an active area of research with many contributions in data warehouse design, storage, view selection, cube computation, indexing, and query evaluation.

BACKGROUND

A data warehouse is a copy of transaction data specifically structured for querying and analysis (Kimball, 1996). Businesses want to separate their operational data from the data used for analysis for various reasons such as performance, maintenance, and security. Operational data is optimized for online transaction processing (OLTP) transactions that include lots of updates and simple queries. Analysis queries, on the other hand, are typically read only and need to aggregate large amounts of data. Data warehousing refers to the process of transforming the operational data, removing the extra fields that are not necessary for analysis, and storing it in a specialized store. In a typical scenario, the data in a data warehouse need not be current and is updated peri-

odically with fresh data from the main store. This flexibility makes it possible to optimize the data warehouse for analytical queries rather than for updates. The data stored in a warehouse is typically multidimensional in nature and consists of a set of dimensions and metrics. In our sales example, the dimensions are products, stores, and time. The data records the sales value for different combinations of these dimensions. “Sales” is referred to as a metric in this case. Analysts might ask queries such as:

Q1. What are the total sales for each product?

Q2. What are the sales for the product “Skis” in the month of January over all stores?

To answer these queries, we would have to aggregate the data in different ways. For example, Q1 requires the addition of sales figures across all stores and over all time for each product. Multidimensional data analysis refers to this process of viewing and analyzing the data along different dimensions. A multidimensional data model is a multidimensional view of the data that identifies the various dimensions and metrics in it. Dimensions often have hierarchies on them along which they can be analyzed. For example, the time dimension has the hierarchy of time, date, month, quarter, and year. We could ask queries at different levels in the hierarchy, such as the monthly sales, quarterly sales, and yearly sales.

OLAP tools provide multidimensional analysis functionality such as aggregation, forecasting, ranking, rolling up, drilling down, etc. Though there is no standard definition of OLAP, it can be summarized in five keywords: fast analysis of shared multidimensional information (Pendse, 2003). The critical challenge for OLAP tools is to make the analysis fast so that it can be interactive. The result of the queries might be small, but the queries are quite expensive to compute since they aggregate a lot of data to get the results. For example, for Q1 we need to aggregate the sales data across all stores and over all time, which can be huge. OLAP systems use a wide variety of techniques such as specialized storage and indexing, pre-computation, and caching to provide interactive analysis.

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Data Warehousing, Multi-Dimensional Data Models and OLAP

CURRENT TECHNOLOGIES

Research in data warehousing and OLAP can be classified into the following broad areas.

Data Warehouse Design

A data warehouse is a collection of integrated information that has been designed for analytical querying. The source of this data could be one or more operational OLTP systems or other legacy systems. To get the data in a usable form, the source data needs to be processed to remove the unnecessary attributes, aggregated, and mapped to the schema of the data warehouse. The major issues in building a data warehouse are schema design and maintenance issues such as data cleansing, initial loading, incremental updates, data purging, and metadata management. The warehouse schema is designed by first building a conceptual schema in terms of dimensions and metrics and then mapping it to a logical schema that might be relational or multidimensional. The methodologies proposed for schema integration and multidatabase systems can be applied for data cleansing and mapping the source data to the destination schema. A data warehouse can be considered to be a set of materialized views on the source data. The updates in the source data need to be propagated to the data warehouse. There are various issues to be considered such as consistency requirements, refreshing timing, refreshing frequency, refreshing modes, and refreshing techniques. OLAP systems typically don’t have as stringent requirements as OLTP systems regarding the currency of data. Thus most systems update the data warehouse on a periodic basis. The updates could be done in an online fashion with the warehouse still running or an off-line fashion by shutting down the system during the refresh period. Updating the data also requires updating of the various auxiliary structures such as indices and precomputed aggregates that might be built on it. There has been a lot of research on the problem of view maintenance; we cite only a representative sample (Labio, Yerneni, & Garcia-Molina, 1999; Mumick, Quass, & Mumick, 1997; Quass & Widom, 1997). Metadata management plays a key role in data warehouses for keeping track of what data is available, where it is available, mapping between the different data elements, transformations required, etc. Jarke, Lenzerini, Vassiliou, and Vassiliadis (2003) stress that the quality of a data warehouse can be accessed and improved using enriched quality-focused metadata management and methodologies.

Data Modeling And Storage

,

The data in a data warehouse is multidimensional in nature. Though this data could be modeled in traditional ways such as ER modeling or relational modeling, it is more intuitive to think of it in terms of dimensions and facts. Facts represent the entity being measured and are a function of the dimension values. In our sales example, Product, Store, and Time are dimensions, and Sales is a metric. Typically the values for a dimension can be grouped in a hierarchical tree structure. For example, the hierarchy for the time dimension is shown in Figure 1. The analyst can view data at different levels along the hierarchy. Different aggregation functions can be applied to the lower-level data in order to get data at higher levels. Some commonly used aggregation functions are sum, count, min, max, and average. For example, the analyst might want to see the total sales figure at a monthly level. In this case, the aggregate function “sum” will be applied to sales values within each month to get the monthly figure. Some dimensions can have multiple hierarchies defined on them. For example, another hierarchy on the time dimension is week, quarter, year.

The multidimensional data can be stored and organized in different ways. There are two contrasting approaches to this called relational OLAP (ROLAP) and multidimensional OLAP (MOLAP). In a ROLAP system, the data is stored in standard relational tables. The analytical engine is built on top of the relational system and uses standard SQL to access data in the tables. The multidimensional data model is most commonly mapped onto a star schema. A star schema consists of a set of fact tables and dimension tables. A fact table has measure attributes that record the facts and dimension attributes that form a foreign key to the dimension tables. The dimension tables store the dimension values, their attributes, and the hierarchies on them. The star schema for the sales example consists of a fact table called Sales and three dimension tables called Product, Time, and Store as shown in Figure 2.

Figure 1. Hierarchy in Time dimension

 

 

2004

 

Year

Q1

Q2

Q3

Q4

Quarter

Jan

Feb

Mar

 

Month

 

 

 

 

01/01

01/02

 

 

Day

 

 

 

135

TEAM LinG

Data Warehousing, Multi-Dimensional Data Models and OLAP

Figure 2.Star Schema

 

ProductId

 

TimeId

Id

StoreId

Location

Sales

City

NumUnits

 

County

 

State

Sales

Country

 

Store

 

Id

Sub-Category

Category

Product

Id

Date

Month

Quarter

Year

Time

simple size-based algorithm (Shukla, Deshpande, & Naughton, 1998), and a maintenance time-constrained solution (Gupta & Mumick, 1999). Since this choice is based on the sizes of the aggregates, a related problem is to estimate the sizes, for which statisticaland sam- pling-based methods have been proposed (Shukla, Deshpande, & Naughton, 1996). The data warehouse needs to be aware of the materialized views and exploit them during query processing. If any of the views can be used optimally, queries against the warehouse are rewritten to run against the pre-computed views rather than the basedata(Halevy,2001;Park,Kim,&Lee,2001;Srivastava, Dar, Jagadish, & Levy, 1996).

The dimension tables in a star schema are de-normal- ized in order to reduce the number of joins during query processing. The alternative approach is to have a snowflake scheme in which each dimension table is normalized into multiple tables. The advantages of ROLAP are that it uses tested relational technology with well-devel- oped query processing and optimization, is highly scalable, can coexist with other data sources, and does not need any specialized storage mechanisms.

MOLAP systems, on the other hand, build a specialized data store that allows multidimensional access to the data. The data could be stored as multidimensional arrays, but specialized techniques are necessary to handle large arrays and sparsity of data. A lot of research has been done in developing specialized multidimensional storage and the related query processing (Cheung, Zhou, Kao, Kan, & Lee, 2001; Deshpande, Ramasamy, Shukla, & Naughton, 1998; Sarawagi & Stonebraker, 1994; Zhao, Deshpande, & Naughton, 1997). MOLAP systems typically have better performance due to specialized storage structures, but the downside is the lack of scalability, size blowup, and the need to re-implement storage and query processing.

View Selection and Size Estimation

In order to efficiently answer queries, many OLAP systems pre-compute aggregates/views along some dimensions and store them in the data warehouse. However, the number of possible aggregates is exponential in the number of dimensions. Due to space and time limitations, it is impossible to compute all the possible aggregates. Under these constraints, it is important to choose a set of aggregates to pre-compute in order to reduce the overall query execution time. A lot of algorithms have been proposed for this problem, including a greedy optimization algorithm (Harinarayan, Rajaraman, & Ullman, 1996), a constrained lattice solution (Baralis, Paraboschi, & Teniente, 1997), a

Query Processing

OLAP operations correspond to exploring the multidimensional data space by moving up the dimension hierarchy (roll up), moving down (drill down), restricting to a dimension value (slice), selecting an aggregated subspace (dice), and cross tabulation (pivot). The basic requirement to answer all these queries is aggregation. OLAP systems need specialized query processing and indexing techniques to handle complex aggregate queries efficiently. Depending on the architecture (ROLAP or MOLAP), different techniques can be used. For relational systems, the SQL language has been extended to be able to specify OLAP kinds of operations. One of the earliest extensions was the CUBE operator proposed by Gray et al. (1997) in order to specify a set of group-bys that are computed together. SQL extensions involving array-based calculations for complex modeling and spreadsheets have also been recently proposed (Witkowski, 2003). In the multidimensional world, there have been efforts to define a query language specialized for multidimensional data such as the MDX language (Whitehorn, Zare, & Pasumansy, 2003).

One of the important problems in OLAP is computing and organizing the cube. The cube is a set of all possible aggregates over the dimensions. For example, a Sales cube on the Store and Product dimensions will have the following group-bys: Sales by (Store, Product), Sales by (Store), Sales by (Product), and the total Sales. Most of the techniques of computing the cube try to share the common computations among the different group-bys in order to reduce the overall time (Agarwal et al., 1996; Ross & Srivastava, 1997; Zhao et al., 1997). Since a cube can be very big, various ways to restrict the cube have been proposed such as the iceberg cube that considers only aggregate values above a certain threshold (Beyer & Ramakrishnan, 1999; Fang, Shivakumar, Garcia-Molina, Motwani, & Ullman, 1998; Xin, Han, Li, & Wah, 2003). Some systems consider reducing the size by computing compressed cubes by approximation, building special-

136

TEAM LinG

Data Warehousing, Multi-Dimensional Data Models and OLAP

ized indices, or other ways of condensing the cube (Barbara & Sullivan, 1997; Lakshmanan, Pei, & Han, 2003; Sismanis, Deligiannakis, Roussopoulus, & Kotidis, 2002). Another interesting class of queries in OLAP is the top- k queries that return k tuples having the highest or lowest values for some attribute, expression, or function (Carey & Kossmann, 1997). In ROLAP systems, the join between the fact table and the dimension tables is a frequent operation that can be quite expensive. Join indices and bit-map indices have been proposed to make this operation more efficient (Chan & Ioannidis, 1998). Many OLAP systems build a mid-tier cache to exploit the locality in OLAP queries and reuse results. Several architectures for building a multidimensional cache specialized for OLAP have been proposed in the literature (Dar, Franklin, Jonsson, Srivastava, & Tan, 1996; Deshpande et al., 1998; Scheuermann, Shim, & Vingralek, 1996;).

REFERENCES

D

Agarwal, S., Agrawal, R., Deshpande, P. M., Gupta, A., Naughton, J. F., Ramakrishnan, R., et al. (1996). On the computation of multidimensional aggregates. Proceedings of the 22nd International Conference on VLDB (pp. 506-521).

Baralis, E., Paraboschi, S., & Teniente, E. (1997). Materialized view selection in a multidimensional database.

Proceedings of the 23rd International Conference on VLDB (pp. 156-165).

Barbara, D., & Sullivan, M. (1997). Quasi-cube: Exploiting approximation in multidimensional databases. SIGMOD Record, 26, 12-17.

Beyer, K., & Ramakrishnan, R. (1999). Bottom-up computation of sparse and iceberg cubes. Proceedings of the 1999 ACM SIGMOD Conference (pp. 359-370).

FUTURE TRENDS

Typically, OLAP works on data that is periodically refreshed. However, in some cases, currency of data might be a critical requirement. Providing real-time decision support in such a case is still a challenge. Also, traditionally, OLAP has dealt with structured data. However, there is a lot of data that is unstructured, in the form of text fields and documents. Analyzing this text data to infer meaning and sentiment opens up new problems. There is work in probabilistic OLAP aimed at providing an OLAP kind of analysis on a combination of structured and unstructured data.

CONCLUSION

Data warehousing and OLAP are becoming increasingly important for business intelligence in order to extract key insights from the vast amounts of data being collected. Fast analysis is the goal for these systems so that users can interactively analyze the data. There are a lot of challenges in building such a system, including data modeling, schema design, loading, maintenance, query processing, etc. After a late start compared to the industry, there has been a lot of active research in this area contributing key new technologies. Among the OLAP vendors there are two main camps: those that build a specialized multidimensional engine and those trying to push a lot of OLAP functionality into relational databases. Both approaches have their merits, and the end result might just be a hybrid system in which a multidimensional middle tier sits on a relational back-end.

Carey, M. J., & Kossmann, D. (1997). Processing top N and bottom N queries. IEEE Data Engineering Bulletin, 20, 12-19.

Chan, C. Y., & Ioannidis, Y. E. (1998). Bitmap index design and evaluation. Proceedings of the 1998 ACM SIGMOD Conference (pp. 355-366).

Cheung, D. W., Zhou, B., Kao, B., Kan, H., & Lee, S. D. (2001). Towards the building of a dense-region-based OLAP system. Data and Knowledge Engineering, 36(1), 1-27.

Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., & Tan, M. (1996). Semantic data caching and replacement.

Proceedings of the 22nd International Conference on VLDB (pp. 330-341).

Deshpande, P. M., Ramasamy, K., Shukla, A., & Naughton, J. F. (1998). Caching multidimensional queries using chunks. Proceedings of the 1998 ACM SIGMOD Conference (pp. 259-270).

Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., & Ullman, J. (1998). Computing iceberg queries efficiently.

Proceedings of the 24th International Conference on VLDB (pp. 299-310).

Gupta, H., & Mumick, I. S. (1999). Selection of views to materialize under a maintenance cost constraint. ICDT ’99, Seventh International Conference (pp. 453-470).

Gray, J., Chaudhari, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., & Pirahesh, H. (1997). Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Mining and Knowledge Discovery, 1, 29-53.

137

TEAM LinG

Data Warehousing, Multi-Dimensional Data Models and OLAP

Halevy, A. Y. (2001). Answering queries using views: A survey. The VLDB Journal, 10, 270-294.

Harinarayan, V., Rajaraman, A., & Ullman, J. D. (1996). Implementing data cubes efficiently. Proceedings of the 1996 ACM SIGMOD Conference (pp. 205-216).

Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (2003). Fundamentals of data warehouses (2nd rev. ed.). New York: Springer-Verlag.

Kimball, R. (1996). The data warehouse toolkit: Practical techniques for building dimensional data warehouses. Hoboken, NJ: Wiley, John & Sons.

Labio, W. J., Yerneni, R., & Garcia-Molina, H. (1999). Shrinking the warehouse update window. Proceedings of the 1999 ACM SIGMOD Conference (pp. 383-394).

Lakshmanan, L. V. S., Pei, J., & Han, J. (2002). Quotient cubes: How to summarize the semantics of a data cube.

Proceedings of the 28th International Conference on VLDB (pp. 778-789).

Mumick, I. S., Quass, D., & Mumick, B. S. (1997). Maintenance of data cubes and summary tables in a warehouse.

Proceedings of the 1997 ACM SIGMOD Conference (pp. 100-111).

Park,C.-S.,Kim,M.H.,&Lee,Y.-J.(2001).RewritingOLAP queries using materialized views and dimension hierarchies in data warehouses. Proceedings of the 17th International Conference on Data Engineering (pp. 515-523).

Pendse, N. (2003). The OLAP report. Retrieved from http:/ /www.olapreport.com/

Quass, D., & Widom, J. (1997). On-line warehouse view maintenance. Proceedings of the 1997 ACM SIGMOD Conference (pp. 393-404).

Ross, K., & Srivastava, D. (1997). Fast computation of sparse datacubes. Proceedings of the 23rd International Conference on VLDB (pp. 116-125).

Sarawagi, S., & Stonebraker, M. (1994). Efficient organization of large multidimensional arrays. Proceedings of the 10th International Conference on Data Engineering

(pp. 328-336).

Scheuermann, P., Shim, J., & Vingralek, R. (1996). Watchman: A data warehouse intelligent cache manager. Proceedings of the 22nd International Conference on VLDB

(pp. 51-52).

Shukla, A., Deshpande, P. M., Naughton, J. F., & Ramasamy, K. (1996). Storage estimation for multidimensional aggregates in the presence of hierarchies. Proceed-

ings of the 22nd International Conference on VLDB (pp. 522-531).

Shukla, A., Deshpande, P. M., & Naughton J. F. (1998). Materialized view selection for multidimensional datasets.

Proceedings of the 24th International Conference on VLDB (pp. 488-499).

Sismanis, Y., Deligiannakis, A., Roussopoulus, N., & Kotidis, Y. (2002). Dwarf: Shrinking the petacube. Proceedings of the 2002 ACM SIGMOD Conference (pp. 464475).

Srivastava, D., Dar, S., Jagadish, H. V., & Levy, A. Y. (1996). Answering queries with aggregation using views.

Proceedings of the 22nd International Conference on VLDB (pp. 318-329).

Whitehorn, M., Zare, R., & Pasumansy, M. (2003). Fast track to MDX. New York: Springer-Verlag.

Witkowski, A., Bellamkonda, S., Bozkaya, T., Dorman, G., Folkert, N., Gupta, A., Sheng, L. & Subramanian, S. (2003). Spreadsheets in RDBMS for OLAP. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (pp. 9-12).

Wu, M. C., & Buchmann, A. P. (1997). Research issues in data warehousing. Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), GI-Fachtagung, Ulm, 5.-7. März 1997, Proceedings (pp. 61-82).

Xin, D., Han, J., Li, X., & Wah, B. W. (2003). Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. Proceedings of the 29th International Conference on VLDB (pp. 476-487).

Zhao, Y., Deshpande, P. M., & Naughton, J. F. (1997). An array-based algorithm for simultaneous multidimensional aggregates. Proceedings of the 1997 ACM SIGMOD Conference (pp. 159-170).

KEY TERMS

Aggregate Function: A function that is used to aggregate a set of values to get a single value.

Cube: A collection of data aggregated at all possible levels over the dimensions.

Dice: Selecting a range on multiple dimensions to select a sub-cube of the original space.

Dimension: This refers to an axis along which the facts are recorded.

138

TEAM LinG

Data Warehousing, Multi-Dimensional Data Models and OLAP

Dimension Table: A database table that stores the different possible values for a dimension, the attributes for those values, and the hierarchical values to which it maps.

Dense Data: Data that has metric values for a substantial percentage of all possible combinations of the dimension values.

Drill Down: The process of moving down a level along a dimension hierarchy during analysis. Drill down results in detailed data at a finer level of granularity.

Fact Table: A database table that stores the base facts that consist of values of different metrics for different combinations of dimension values.

Hierarchy: A hierarchy on a dimension represents the different levels at which data can be aggregated and viewed at along that dimension.

Hybrid OLAP: A hybrid architecture that stores the base data in a relational form, but also builds specialized persistent multidimensional structures on it to efficiently answer multidimensional queries.

Metric/Measure: A value that represents a certain fact that is being recorded in the transaction.

MOLAP: Architecture for OLAP systems that uses specialized multidimensional storage and access methods, such as arrays, for the data.

Pre-Computation: This refers to pre-computing commonly required aggregates that are expensive to compute D on the fly.

ROLAP: Architecture for OLAP systems that uses a relational system as the storage mechanism for the data.

Roll Up: The process of moving up a level along a dimension hierarchy during analysis. Roll up results in more aggregated data at a coarser level of granularity.

Slice: Restricting the area of interest to a single value along a given dimension.

Snowflake Schema: A database schema with a fact table and a set of dimension tables. There are multiple tables per dimension with each table storing one level in the dimension hierarchy.

Sparse Data: Data that has metric values for a very small percentage of all possible combinations of the dimension values.

Star Schema: A database schema that consists of a single fact table and a single table per dimension. The dimension table stores the entire hierarchy for that dimension.

139

TEAM LinG

Соседние файлы в предмете Электротехника