DATA WAREHOUSING Database System Concepts-7th Edition 11.7 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 11.7 ©Silberschatz, Korth and Sudarshan th Edition DATA WAREHOUSING
Data Warehousing ■ Data sources often store only current data,not historical data Corporate decision making requires a unified view of all organizational data, including historical data A data warehouse is a repository(archive)of information gathered from multiple sources,stored under a unified schema,at a single site Greatly simplifies querying,permits study of historical trends 。 Shifts decision support query load away from transaction processing systems Database System Concepts-7th Edition 11.8 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 11.8 ©Silberschatz, Korth and Sudarshan th Edition Data Warehousing ▪ Data sources often store only current data, not historical data ▪ Corporate decision making requires a unified view of all organizational data, including historical data ▪ A data warehouse is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site • Greatly simplifies querying, permits study of historical trends • Shifts decision support query load away from transaction processing systems
Data Warehousing data source 1 data loaders data source 2 DBMS query and analysis tools data warehouse data source n Database System Concepts-7th Edition 11.9 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 11.9 ©Silberschatz, Korth and Sudarshan th Edition Data Warehousing
Design Issues When and how to gather data Source driven architecture:data sources transmit new information to warehouse either continuously or periodically (e.g.,at night) Destination driven architecture:warehouse periodically requests new information from data sources Synchronous vs asynchronous replication Keeping warehouse exactly synchronized with data sources(e.g., using two-phase commit)is often too expensive Usually OK to have slightly out-of-date data at warehouse -Data/updates are periodically downloaded form online transaction processing (OLTP)systems. What schema to use ·Schema integration Database System Concepts-7th Edition 11.10 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 11.10 ©Silberschatz, Korth and Sudarshan th Edition Design Issues ▪ When and how to gather data • Source driven architecture: data sources transmit new information to warehouse ▪ either continuously or periodically (e.g., at night) • Destination driven architecture: warehouse periodically requests new information from data sources • Synchronous vs asynchronous replication ▪ Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is often too expensive ▪ Usually OK to have slightly out-of-date data at warehouse ▪ Data/updates are periodically downloaded form online transaction processing (OLTP) systems. ▪ What schema to use • Schema integration
More Warehouse Design Issues Data transformation and data cleansing E.g.,correctmistakes in addresses(misspellings,zip code errors) Merge address lists from different sources and purge duplicates How to propagate updates Warehouse schema may be a(materialized)view of schema from data sources View maintenance What data to summarize Raw data may be too large to store on-line Aggregate values(totals/subtotals)often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values Database System Concepts-7th Edition 11.11 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 11.11 ©Silberschatz, Korth and Sudarshan th Edition More Warehouse Design Issues ▪ Data transformation and data cleansing • E.g., correct mistakes in addresses (misspellings, zip code errors) • Merge address lists from different sources and purge duplicates ▪ How to propagate updates • Warehouse schema may be a (materialized) view of schema from data sources ▪ View maintenance ▪ What data to summarize • Raw data may be too large to store on-line • Aggregate values (totals/subtotals) often suffice • Queries on raw data can often be transformed by query optimizer to use aggregate values