Implement fuzzy grouping, fuzzy lookup with SSIS

A frequent problem that you may experience when you bring data from multiple sources that need to be unified and consolidated and/or needs to be deduplicated. In the case of an enterprise maintaining multiple customer or product sales information in for various business entities, e.g. retail and web sales, or possibly different CRM programs that are a result of a company merger, all of this information has to be brought under one roof within the data warehouse.

SQL Server provides two fuzzy transformations to help with such scenarios:
  • Fuzzy grouping
  • Fuzzy lookup

These transformations can be used independently or in concert to assist with unification and deduplication of your data. Both the fuzzy transformation algorithms create and utilize temporary tables created within SQL Server.

Fuzzy grouping
Fuzzy grouping is used primarily for deduplicating and standardizing values in column data. Fuzzy grouping has input parameters and this transformation algorithm adds output columns, which are included to its results.

Input parameters
Token delimiter Identifies which characters are used to break up the string into smaller units to be compared (e.g. spaces, tabs, new line, comma, etc.).
Similarity thresholds Value range is from 0 to 1 (1 being an exact match)

Output columns
_key_in Contains a unique identifier key for each row
_key_out Contains a group identifier key for duplicate rows; all duplicate rows have the same value for this column
_score This indicates the similarity to the row, which Fuzzy Grouping has identified as the canonical (correct) row. The score ranges from 0 to 1. Only the canonical row has a score value of 1.


















Fuzzy Lookup
Fuzzy Lookup is used for identity mapping to look up values in a reference table using a fuzzy match where the data may have different spellings for customer names, street names, or city.

Fuzzy lookup has input parameters and this transformation algorithm also adds output columns which are included to its results.

Input parameters
Maximum Number Of Matches To Output Per Lookup Set this value to any value >= 1
Token Delimiter Identifies which characters are used to break up the string into smaller units to be compared (e.g. spaces, tabs, new line, comma, etc.)
Similarity Thresholds Value range is from 0 to 1 (1 being an exact match)

Output columns
_Similarity This value indicates the similarity between input value and reference column value. The range is between 0 to 1 (1 being an exact match)
_Confidence This indicates the level of confidence or quality of the match between the input value and the reference column value

Comments

Popular Posts

Non-blocking, semi-blocking, and blocking transforms in SSIS

Implement additive, semi-additive, and non-additive measures

Design an appropriate storage solution, including hardware, disk, and file layout