Posts

Implement an ETL solution that supports incremental data loading

Image
Design a control flow to load change data In creating an incremental load Data Flow, you can create a dynamic query within the OLE DB  Source Editor under Connection Manager page, go to the Data Access Mode drop-down and  select SQL Command. This opens up a free text pane below labeled SQL Command Text. Here  we can enter the query against the source table with a variable parameter in the WHERE clause.  The parameter is identified by the “?” in the SQL text. “?” is the syntax used by the SSIS expression  language for a parameter. With the SQL query in place, click the Parameters button to map the SQL query parameter  to the SSIS variable @LastModifiedDate, which will pass in the value of MAX(ModifiedDate)  from the fact table. Make sure the Param Direction is set to Input because the SSIS  variable value is being passed in to be used by the query. Load data by using Transa...

Implement an ETL solution that supports incremental data extraction

Image
Design fact table patterns Fact tables used in data warehouses can become the very large, very fast by order of magnitude.  It’s important to implement best practices for fact table design at the onset, even if your  fact table does not contain billions of rows. Here are some things to be aware and include in  your fact table design: Consider partitioning your fact tables when they are larger than 100 GB. The partitions can be by any column, which provides a fairly equal distribution of your data. Usually the Date columns (not DateTime columns) are good candidates. This provides more granularity when creating indexes, loading, updating, and deleting data (by partitions) Faster data access for queries by limiting the IO to specific partitions Utilize partition switching to move data in and out almost instantaneously Implement sliding window using partition splitting and partition merging Build clustered indexes on the Date Key of the fact tables (not...

Determine appropriate scenarios for Transact-SQL joins versus SSIS lookup

Some common design patterns for data warehouse loads invlove extracting data from one or  more sources and performing business and dimension key column lookups and joins. In most  instances, SSIS packages can be optimized to perform these lookups and joins efficiently by  tuning package control and data flow properties. In addition, it might be necessary to ensure  that database object maintenance activities are done in a regular basis including, index defragmentation  and statistics update. The SSIS Lookup transform comes very handy when doing lookup of dimension keys. For  example, when loading a fact table with customer transactions, we lookup the customer’s  business key against the business keys stored in the customer dimension. If we find a match, we  retrieve the customer’s surrogate key and insert it in the fact table along with the corresponding  transactions. In some cases, the best alternative to overcome SSIS Lookup limit...

Non-blocking, semi-blocking, and blocking transforms in SSIS

Not all Data Flow transforms are created equal. In fact, some are much faster and require a  lot less memory (non-blocking transforms). Others are slower while consuming much more  memory (blocking transforms), and some are in between (semi-blocking transforms). We don’t  always have a choice if we need to use a certain transform to do a certain specified task, but  it is good to know the differences between these types of transforms so that we can better  understand performance and resource utilization issues. Synchronous and asynchronous There is another related aspect about Data Flow transforms, which is their ability to quickly  process a row as they are coming into the transform, independently of any other rows that  came before or after (synchronous transforms). The other type of transform needs to be  dependent on some or all of the rows that come before and after (asynchronous transforms). On the whole, non-blocking transforms a...

Implement fuzzy grouping, fuzzy lookup with SSIS

Image
A frequent problem that you may experience when you bring data from multiple sources  that need to be unified and consolidated and/or needs to be deduplicated. In the case of an  enterprise maintaining multiple customer or product sales information in for various business  entities, e.g. retail and web sales, or possibly different CRM programs that are a result of  a company merger, all of this information has to be brought under one roof within the data  warehouse. SQL Server provides two fuzzy transformations to help with such scenarios: Fuzzy grouping Fuzzy lookup These transformations can be used independently or in concert to assist with unification  and deduplication of your data. Both the fuzzy transformation algorithms create and utilize  temporary tables created within SQL Server. Fuzzy grouping Fuzzy grouping is used primarily for deduplicating and standardizing values in column data. Fuzzy grouping has input parameters a...

Implement slowly changing dimension with SSIS

Image
To implement Slowly Chaning Dimensions using the SSIS  SCD components, followe these steps. 1. In your Control Tab, add a Data Flow task 2. Edit the Data Flow task and add an OLE DB Source 3. Double-click the OLE DB Source and connect to your source table, either in the OLTP     environment, or a staging table within the DW 4. Add a Slowly Changing Transform and connect it to the precedence constraint from the     OLDE DB Source 5. Next, Double-click the Slowly Changing Dimension transform to start the Slowly Changing  Dimension Wizard, and click Next 6. On the Select A Dimension Table And Keys window, select Connection Manager to              connect  to the destination DW 7. Select Table Or View for the destination table 8. Map the Input Columns to the Dimension Columns 9. Make sure that only one column is marked as “Business Key”, which is the key from the ...

Implement data profiling, parallelism, transactions, logging, and security

Data profiling Allows you to evaluate the content, structure, and quality of the source dataset.  It cannot identify inaccurate or bad data, just metrics about the existing data to see how useful  or difficult it will be to use the data. This includes the frequency of the values, the datatypes,  NULLs, and length of data in the columns. SSIS has a Data Profiler Task that can assist with profiling your data. This task saves the  results of the data analysis in an XML file that can be accessed by another program, Data Profile  Viewer. Here are the types of information you can glean from it: Column Length Distribution Profile Column Null Ratio Profile Column Pattern Profile Column Statistics Profile Column Value Distribution Profile Candidate Key Profile Functional Dependency Profile Value Inclusion Profile What is parallelism in SSIS? Parallelism is simply running multiple SSIS tasks concurrently (together in parallel). So when  you...