Implement an ETL solution that supports incremental data extraction

Design fact table patterns

Fact tables used in data warehouses can become the very large, very fast by order of magnitude. It’s important to implement best practices for fact table design at the onset, even if your fact table does not contain billions of rows. Here are some things to be aware and include in your fact table design:
  • Consider partitioning your fact tables when they are larger than 100 GB. The partitions can be by any column, which provides a fairly equal distribution of your data. Usually the Date columns (not DateTime columns) are good candidates.
    • This provides more granularity when creating indexes, loading, updating, and deleting data (by partitions)
    • Faster data access for queries by limiting the IO to specific partitions
    • Utilize partition switching to move data in and out almost instantaneously
    • Implement sliding window using partition splitting and partition merging
  • Build clustered indexes on the Date Key of the fact tables (not DataTime)
    • This allows for more efficient queries when accessing historical data and loading SSAS cubes.
    • When working within a maintenance window the following settings for the fact table clustered index will speed up the scan operations during query times by minimizing granular-level locking on the table:
      • ALLOW_ROW_LOCKS = OFF
      • ALLOW_PAGE_LOCKS = OFF
  • Build nonclustered indexes on each of the fact table foreign keys to allow for faster selection of records when joining again the corresponding dimension tables.
Once all of the data has been initially populated into the DW, probably over the weekend,
the incremental loads are set up as a repeating job that will run nightly, or throughout the day, depending on the business requirement for refreshing the data warehouse.

The incremental loading can be done using SSIS with two different methods:

CDC Control Task This is the Change Data Capture component within SSIS
Execute SQL Task This can be implemented via the T-SQL MERGE statement

Enable Change Data Capture
Change Data Capture (CDC) is a SQL Server built-in audit-tracking system that can automatically track and record any change activity to a table: INSERT, UPDATE, DELETE, which can be used for the purpose of incremental loading. In order to use CDC, someone with sysadmin level permissions must enable CDC for the database, which contains the tables that need to be tracked. Once it is enabled on the database level, someone with at least dbo_owner level permissions can enable CDC on the specific table that will be tracked for changes.

Create a SQL MERGE statement
Alternatively, if you do have a way to keep track of which records have changed at the source system, you can implement the simple, yet powerful T-SQL MERGE command. The MERGE command allows your table to be scanned once and checked for all changes: inserts, updates, and deletes. And you can choose if you want to affect changes for any one or all of these change types into your target table.



















In this example the StudentSource and StudentTarget tables are matched on ID, which is their key column. The first clause, WHEN MATCHED performs and update when the IDs match. The next clause, WHEN NOT MATCHED BY TARGET THEN, checks the TARGET table to see if any IDs don’t exist in the StudentTarget and inserts the new records, but in this case, there are no new ID values to insert. The last clause, WHEN NOT MATCHED BY SOURCE THEN, checks the SOURCE table to see if any IDs have been removed, and deletes those records form StudentTarget. The MERGE statement can check either the Source or Target for MATCHED or NOT MATCHED, but the changes are only made to the target table.

Comments

Popular Posts

Non-blocking, semi-blocking, and blocking transforms in SSIS

Implement additive, semi-additive, and non-additive measures