Implement fuzzy grouping, fuzzy lookup with SSIS

- December 26, 2017

A frequent problem that you may experience when you bring data from multiple sources that need to be unified and consolidated and/or needs to be deduplicated. In the case of an enterprise maintaining multiple customer or product sales information in for various business entities, e.g. retail and web sales, or possibly different CRM programs that are a result of a company merger, all of this information has to be brought under one roof within the data warehouse.

SQL Server provides two fuzzy transformations to help with such scenarios:

Fuzzy grouping
Fuzzy lookup

These transformations can be used independently or in concert to assist with unification and deduplication of your data. Both the fuzzy transformation algorithms create and utilize temporary tables created within SQL Server.

Fuzzy grouping

Fuzzy grouping is used primarily for deduplicating and standardizing values in column data. Fuzzy grouping has input parameters and this transformation algorithm adds output columns, which are included to its results.

Input parameters

Token delimiter Identifies which characters are used to break up the string into smaller units to be compared (e.g. spaces, tabs, new line, comma, etc.).

Similarity thresholds Value range is from 0 to 1 (1 being an exact match)

Output columns

_key_in Contains a unique identifier key for each row

_key_out Contains a group identifier key for duplicate rows; all duplicate rows have the same value for this column

_score This indicates the similarity to the row, which Fuzzy Grouping has identified as the canonical (correct) row. The score ranges from 0 to 1. Only the canonical row has a score value of 1.

Fuzzy Lookup

Fuzzy Lookup is used for identity mapping to look up values in a reference table using a fuzzy match where the data may have different spellings for customer names, street names, or city.

Fuzzy lookup has input parameters and this transformation algorithm also adds output columns which are included to its results.

Input parameters

Maximum Number Of Matches To Output Per Lookup Set this value to any value >= 1

Token Delimiter Identifies which characters are used to break up the string into smaller units to be compared (e.g. spaces, tabs, new line, comma, etc.)

Similarity Thresholds Value range is from 0 to 1 (1 being an exact match)

Output columns

_Similarity This value indicates the similarity between input value and reference column value. The range is between 0 to 1 (1 being an exact match)

_Confidence This indicates the level of confidence or quality of the match between the input value and the reference column value

Search This Blog

iBI

Implement fuzzy grouping, fuzzy lookup with SSIS

Comments

Post a Comment

Popular Posts

Non-blocking, semi-blocking, and blocking transforms in SSIS

Use the Master Data Services Configuration Manager

Design and implement ETL control flow elements, including containers, tasks, and precedence constraints