Wednesday, August 18, 2010

SSIS 2008 Transforms: Term Extraction and Term Lookup

SSIS 2008 includes a very good lineup of transformation task components.  In this article, we take a brief look at a pair of very interesting, related transforms that help us analyze unstructured data: Term Extraction and Term Lookup.

The Term Extraction transform
Have you ever had the need to extract the nouns and noun phrases from a text column (or a text file)?  And when writing that long program to do it, did you wish there was a simpler way?  If the answers are "yes" and "yes", then the Term Extraction transform is your friend!  This transform analyzes free-flowing text in your data source, and enables you to pull out nouns and noun phrases that are found in it, along with the frequencies with which they occur.

Advanced settings for the Term Extraction transform enable you to select just 1-word nouns ("SQL"), noun phrases ("SQL Server 2008"), or both nouns and noun phrases.  The count itself can be just a simple count of the frequency of occurrence, or computed using a formula called TFIDF, which stands for Term Frequency and Inverse Document Frequency.  The TFIDF calculation delivers a weighted result, based upon the frequency of the word or phrase, number of rows in which it occurs, and the total number of rows in the data.  Try out both approaches, to see which one best meets your needs.

The Term Lookup transform
Term Lookup also breaks up incoming text into nouns and noun phrases; however, it then compares these to a set of user-defined terms and delivers the frequency counts of only the matching ones, on a row by row basis.  So, once you have the list of terms from the earlier (Term Extraction) transform, you can review and edit that list and retain only the ones of interest, and then run the Term Lookup transform against that list to get counts of the terms matched on each row.

Since the results of this transform are displayed at row-level detail, a given row may occur multiple times in the results if it contains multiple flagged terms.  If you need to group the results by the term being matched, you'll need to use the Aggregate Transform in conjunction with the Term Lookup transform.

Using these powerful transforms to analyze unstructured data
For an actual project, it could be of great value to run the Term Extract transform initially and then once in a while, and have a domain expert or Business Analyst review the results and edit the list.  That list then is used by the Term Lookup transform, which would run regularly, perhaps weekly.  As a practical example, take the case of the Customer Support call-in system of a packaged software manufacturer.  When logging a support call, the technician will typically select the product line and application module, perhaps select a generic pre-programmed problem code from a dropdown, and then enter a few words of free-format text into a field, noting the problem as described by the customer as well as some comments etc.  Over time, the database may consist of millions of rows.  Using this pair of transforms on that free-format field, management will be able to narrow down the data to the most frequently encountered problems within each product line, thereby getting a clearer view of the areas to focus on.  Powerful results - from unstructured data!

No comments:

Post a Comment