Determine appropriate scenarios for Transact-SQL joins versus SSIS lookup

Some common design patterns for data warehouse loads invlove extracting data from one or more sources and performing business and dimension key column lookups and joins. In most instances, SSIS packages can be optimized to perform these lookups and joins efficiently by tuning package control and data flow properties. In addition, it might be necessary to ensure that database object maintenance activities are done in a regular basis including, index defragmentation and statistics update.

The SSIS Lookup transform comes very handy when doing lookup of dimension keys. For example, when loading a fact table with customer transactions, we lookup the customer’s business key against the business keys stored in the customer dimension. If we find a match, we retrieve the customer’s surrogate key and insert it in the fact table along with the corresponding transactions.

In some cases, the best alternative to overcome SSIS Lookup limitations is to use T-SQL joins. Joins peformed using T-SQL can perform much better in those instances in which the SSIS server cannot fit an entire table in its buffer. The SQL Query Optitmizer can make use of indexes and statistics to produce a more efficient query plan to process large datasets in less time than it would take SSIS to cache all the data. Partition elimination can further improve large query performance.

In addition to lookups and joins, merge operations can scale better using T-SQL than the SSIS Merge, Merge Join or SCD transformations. This is true when staging large volumes of data that need to be merged to identify what needs to be inserted, updated or deleted. Using the SSIS Merge, Merge Join or SCD transformations means that the data needs to be brought into the SSIS server’s memory buffer to apply merge and join operations. This can be very costly when processing large volumes of data or when more complex business rules need to be applied.

In some scenarios in which you are joining and merging data from sources other than SQL Server, SSIS may be the only viable solution if the data is not being staged in a SQL Server database or not being staged at all.

Other scenarios that may influence the decision when to use SSIS over T-SQL may include:
  • Complexity of business rules and data transformations
  • Number of lookups
  • Developer skillsets
  • Organization standards and regulations

Comments

Popular Posts

Non-blocking, semi-blocking, and blocking transforms in SSIS

Implement additive, semi-additive, and non-additive measures

Implement an ETL solution that supports incremental data extraction