Pages

Wednesday, August 18, 2010

SSIS 2008 Transforms: Term Extraction and Term Lookup

SSIS 2008 includes a very good lineup of transformation task components.  In this article, we take a brief look at a pair of very interesting, related transforms that help us analyze unstructured data: Term Extraction and Term Lookup.

The Term Extraction transform
Have you ever had the need to extract the nouns and noun phrases from a text column (or a text file)?  And when writing that long program to do it, did you wish there was a simpler way?  If the answers are "yes" and "yes", then the Term Extraction transform is your friend!  This transform analyzes free-flowing text in your data source, and enables you to pull out nouns and noun phrases that are found in it, along with the frequencies with which they occur.

Advanced settings for the Term Extraction transform enable you to select just 1-word nouns ("SQL"), noun phrases ("SQL Server 2008"), or both nouns and noun phrases.  The count itself can be just a simple count of the frequency of occurrence, or computed using a formula called TFIDF, which stands for Term Frequency and Inverse Document Frequency.  The TFIDF calculation delivers a weighted result, based upon the frequency of the word or phrase, number of rows in which it occurs, and the total number of rows in the data.  Try out both approaches, to see which one best meets your needs.

The Term Lookup transform
Term Lookup also breaks up incoming text into nouns and noun phrases; however, it then compares these to a set of user-defined terms and delivers the frequency counts of only the matching ones, on a row by row basis.  So, once you have the list of terms from the earlier (Term Extraction) transform, you can review and edit that list and retain only the ones of interest, and then run the Term Lookup transform against that list to get counts of the terms matched on each row.

Since the results of this transform are displayed at row-level detail, a given row may occur multiple times in the results if it contains multiple flagged terms.  If you need to group the results by the term being matched, you'll need to use the Aggregate Transform in conjunction with the Term Lookup transform.

Using these powerful transforms to analyze unstructured data
For an actual project, it could be of great value to run the Term Extract transform initially and then once in a while, and have a domain expert or Business Analyst review the results and edit the list.  That list then is used by the Term Lookup transform, which would run regularly, perhaps weekly.  As a practical example, take the case of the Customer Support call-in system of a packaged software manufacturer.  When logging a support call, the technician will typically select the product line and application module, perhaps select a generic pre-programmed problem code from a dropdown, and then enter a few words of free-format text into a field, noting the problem as described by the customer as well as some comments etc.  Over time, the database may consist of millions of rows.  Using this pair of transforms on that free-format field, management will be able to narrow down the data to the most frequently encountered problems within each product line, thereby getting a clearer view of the areas to focus on.  Powerful results - from unstructured data!

Wednesday, August 4, 2010

The Dimensional Model Star Schema - a quick overview

Data organization - DW vs. OLTP

The tables in a DW are organized precisely opposite to how they are done in a regular OLTP database - highly de-normalized in a DW vs. highly normalized in an OLTP system. The reason is that their objectives are different. A DW system is read-only and built for fast queries on aggregated data, while an OLTP system is read-write and built for fast Create-Retrieve-Update-Delete (CRUD) operations on individual transactions.

Database developers are used to living and breathing normalized systems.  Therefore for them, when transitioning to BI design, it becomes most important to think in terms of de-normalizing everything!

Dimensional modelling - Fact and Dimension tables

Data in a DW is organized into tables that are called Fact and Dimension tables.  The fact tables contain the actual data (often called measures), which is typically numeric and can be aggregated, such as the quantity and unit price of line-items from a sale invoice.  Dimension tables, on the other hand, contain the attributes of that data - the date of sale, item sold, customer name and address, and so on.  Foreign keys on the fact tables link to the primary keys on the dimension tables.  A Data Warehouse can have as few as 1 fact table and 1 dimension table.

Fact tables run deep - millions or sometimes even billions of rows.  Dimension tables are very wide (remember de-normalization!) but not deep - lots of columns but generally maxing out at 1 or 2 million rows.

The arrangement of the fact and dimension tables is called the Star Schema.  To visualize the star, think of the fact table sitting at the center, and lines radiating from it that connect to each dimension.  The instant advantage of the star schema is that each fact table row is just 1 join away from any of its dimensions (though a snowflake dimension is an occasionally-used exception), resulting in super-fast query results.  A query asking for data on sales of Mountain Bikes in California during the last calendar year will likely look at only 3 dimension tables, with each being linked directly to the fact table.  But the same query in the OLTP system, because of normalization, is going to hit a much greater number of joins in relational tables (state - customer address - customer id - invoice header - invoice detail, to determine just the California sales alone).

Data in the DW is not real-time.  It is loaded on a regular schedule from the OLTP system (and even other non-relational sources such as flat files and spreadsheets etc), often nightly, and often by using an ETL tool such as SSIS which can be scheduled to run automatically.