Implement an ETL solution that supports incremental data loading
Design a control flow to load change data
In creating an incremental load Data Flow, you can create a dynamic query within the OLE DB Source Editor under Connection Manager page, go to the Data Access Mode drop-down and select SQL Command. This opens up a free text pane below labeled SQL Command Text. Here we can enter the query against the source table with a variable parameter in the WHERE clause. The parameter is identified by the “?” in the SQL text. “?” is the syntax used by the SSIS expression language for a parameter.
With the SQL query in place, click the Parameters button to map the SQL query parameter to the SSIS variable @LastModifiedDate, which will pass in the value of MAX(ModifiedDate) from the fact table. Make sure the Param Direction is set to Input because the SSIS variable value is being passed in to be used by the query.
Load data by using Transact-SQL Change Data Capture functions
In order to use Change Data Capture for automatic incremental load processing you need to make sure certain items are in place and specific tasks have been done. As mentioned earlier, CDC requires SQL Agent to function, if SQL Agent has been turned off or fails for any reason, the CDC process will be halted.
CDC needs to be enabled on the database level before it can be used. CDC also needs to be enabled on each table to be tracked. You can configure individual columns to be tracked, but by default all columns are select for tracking. CDC tracks the following changes:
- Rows inserted
- Rows updated (records column values before and after update)
- Rows deleted (records column values before delete)
- When the operation occurred
CDC works by creating a new CDC schema and new physical tables for each tracked table within the System Tables folder. These are the tables that contain the changes; there is a single row created for each INSERT and DELETE operation—and two rows created for each UPDATE operation: one row shows how the data looked before the update, and the second row shows the data after the UPDATE.
CDC is asynchronous, which means when a table data is changing, there is no contention with CDC, nor does it slow any transactions down. That is because CDC detects changes by reading the transaction log file, not the source tables. This is managed by the SQL Server Agent jobs. One of the jobs read the logs and records the changes, which means that when you query for changes, there may be a delay between when the data has changed and when it is available in CDC. Due to this delay, and depending on the speed of changes to your source data, you may not see all of the changes until the next time the CDC SQL Agent job runs. There is another SQL Server Agent job that removes the rows from the CDC tables that are older than the configured retention period. The default retention period is three days, which can be customized based on business retention requirements.
Comments
Post a Comment