iBI

Posts

Implement an ETL solution that supports incremental data loading

- December 28, 2017

Design a control flow to load change data In creating an incremental load Data Flow, you can create a dynamic query within the OLE DB Source Editor under Connection Manager page, go to the Data Access Mode drop-down and select SQL Command. This opens up a free text pane below labeled SQL Command Text. Here we can enter the query against the source table with a variable parameter in the WHERE clause. The parameter is identified by the “?” in the SQL text. “?” is the syntax used by the SSIS expression language for a parameter. With the SQL query in place, click the Parameters button to map the SQL query parameter to the SSIS variable @LastModifiedDate, which will pass in the value of MAX(ModifiedDate) from the fact table. Make sure the Param Direction is set to Input because the SSIS variable value is being passed in to be used by the query. Load data by using Transa...

Implement an ETL solution that supports incremental data extraction

- December 27, 2017

Design fact table patterns Fact tables used in data warehouses can become the very large, very fast by order of magnitude. It’s important to implement best practices for fact table design at the onset, even if your fact table does not contain billions of rows. Here are some things to be aware and include in your fact table design: Consider partitioning your fact tables when they are larger than 100 GB. The partitions can be by any column, which provides a fairly equal distribution of your data. Usually the Date columns (not DateTime columns) are good candidates. This provides more granularity when creating indexes, loading, updating, and deleting data (by partitions) Faster data access for queries by limiting the IO to specific partitions Utilize partition switching to move data in and out almost instantaneously Implement sliding window using partition splitting and partition merging Build clustered indexes on the Date Key of the fact tables (not...

Determine appropriate scenarios for Transact-SQL joins versus SSIS lookup

- December 26, 2017

Some common design patterns for data warehouse loads invlove extracting data from one or more sources and performing business and dimension key column lookups and joins. In most instances, SSIS packages can be optimized to perform these lookups and joins efficiently by tuning package control and data flow properties. In addition, it might be necessary to ensure that database object maintenance activities are done in a regular basis including, index defragmentation and statistics update. The SSIS Lookup transform comes very handy when doing lookup of dimension keys. For example, when loading a fact table with customer transactions, we lookup the customer’s business key against the business keys stored in the customer dimension. If we find a match, we retrieve the customer’s surrogate key and insert it in the fact table along with the corresponding transactions. In some cases, the best alternative to overcome SSIS Lookup limit...

Non-blocking, semi-blocking, and blocking transforms in SSIS

- December 26, 2017

Not all Data Flow transforms are created equal. In fact, some are much faster and require a lot less memory (non-blocking transforms). Others are slower while consuming much more memory (blocking transforms), and some are in between (semi-blocking transforms). We don’t always have a choice if we need to use a certain transform to do a certain specified task, but it is good to know the differences between these types of transforms so that we can better understand performance and resource utilization issues. Synchronous and asynchronous There is another related aspect about Data Flow transforms, which is their ability to quickly process a row as they are coming into the transform, independently of any other rows that came before or after (synchronous transforms). The other type of transform needs to be dependent on some or all of the rows that come before and after (asynchronous transforms). On the whole, non-blocking transforms a...

Implement fuzzy grouping, fuzzy lookup with SSIS

- December 26, 2017

A frequent problem that you may experience when you bring data from multiple sources that need to be unified and consolidated and/or needs to be deduplicated. In the case of an enterprise maintaining multiple customer or product sales information in for various business entities, e.g. retail and web sales, or possibly different CRM programs that are a result of a company merger, all of this information has to be brought under one roof within the data warehouse. SQL Server provides two fuzzy transformations to help with such scenarios: Fuzzy grouping Fuzzy lookup These transformations can be used independently or in concert to assist with unification and deduplication of your data. Both the fuzzy transformation algorithms create and utilize temporary tables created within SQL Server. Fuzzy grouping Fuzzy grouping is used primarily for deduplicating and standardizing values in column data. Fuzzy grouping has input parameters a...

Implement slowly changing dimension with SSIS

- December 26, 2017

To implement Slowly Chaning Dimensions using the SSIS SCD components, followe these steps. 1. In your Control Tab, add a Data Flow task 2. Edit the Data Flow task and add an OLE DB Source 3. Double-click the OLE DB Source and connect to your source table, either in the OLTP environment, or a staging table within the DW 4. Add a Slowly Changing Transform and connect it to the precedence constraint from the OLDE DB Source 5. Next, Double-click the Slowly Changing Dimension transform to start the Slowly Changing Dimension Wizard, and click Next 6. On the Select A Dimension Table And Keys window, select Connection Manager to connect to the destination DW 7. Select Table Or View for the destination table 8. Map the Input Columns to the Dimension Columns 9. Make sure that only one column is marked as “Business Key”, which is the key from the ...

Implement data profiling, parallelism, transactions, logging, and security

- December 25, 2017

Data profiling Allows you to evaluate the content, structure, and quality of the source dataset. It cannot identify inaccurate or bad data, just metrics about the existing data to see how useful or difficult it will be to use the data. This includes the frequency of the values, the datatypes, NULLs, and length of data in the columns. SSIS has a Data Profiler Task that can assist with profiling your data. This task saves the results of the data analysis in an XML file that can be accessed by another program, Data Profile Viewer. Here are the types of information you can glean from it: Column Length Distribution Profile Column Null Ratio Profile Column Pattern Profile Column Statistics Profile Column Value Distribution Profile Candidate Key Profile Functional Dependency Profile Value Inclusion Profile What is parallelism in SSIS? Parallelism is simply running multiple SSIS tasks concurrently (together in parallel). So when you...