Wednesday, August 4, 2010

The Dimensional Model Star Schema - a quick overview

Data organization - DW vs. OLTP

The tables in a DW are organized precisely opposite to how they are done in a regular OLTP database - highly de-normalized in a DW vs. highly normalized in an OLTP system. The reason is that their objectives are different. A DW system is read-only and built for fast queries on aggregated data, while an OLTP system is read-write and built for fast Create-Retrieve-Update-Delete (CRUD) operations on individual transactions.

Database developers are used to living and breathing normalized systems.  Therefore for them, when transitioning to BI design, it becomes most important to think in terms of de-normalizing everything!

Dimensional modelling - Fact and Dimension tables

Data in a DW is organized into tables that are called Fact and Dimension tables.  The fact tables contain the actual data (often called measures), which is typically numeric and can be aggregated, such as the quantity and unit price of line-items from a sale invoice.  Dimension tables, on the other hand, contain the attributes of that data - the date of sale, item sold, customer name and address, and so on.  Foreign keys on the fact tables link to the primary keys on the dimension tables.  A Data Warehouse can have as few as 1 fact table and 1 dimension table.

Fact tables run deep - millions or sometimes even billions of rows.  Dimension tables are very wide (remember de-normalization!) but not deep - lots of columns but generally maxing out at 1 or 2 million rows.

The arrangement of the fact and dimension tables is called the Star Schema.  To visualize the star, think of the fact table sitting at the center, and lines radiating from it that connect to each dimension.  The instant advantage of the star schema is that each fact table row is just 1 join away from any of its dimensions (though a snowflake dimension is an occasionally-used exception), resulting in super-fast query results.  A query asking for data on sales of Mountain Bikes in California during the last calendar year will likely look at only 3 dimension tables, with each being linked directly to the fact table.  But the same query in the OLTP system, because of normalization, is going to hit a much greater number of joins in relational tables (state - customer address - customer id - invoice header - invoice detail, to determine just the California sales alone).

Data in the DW is not real-time.  It is loaded on a regular schedule from the OLTP system (and even other non-relational sources such as flat files and spreadsheets etc), often nightly, and often by using an ETL tool such as SSIS which can be scheduled to run automatically.

No comments:

Post a Comment