iBI

Posts

Showing posts with the label ETL

Clean data by using the SSIS DQS task

- January 01, 2018

SQL Server Integration Services (SSIS) includes a DQS Transformation to correct data from a connected data source. The DQS Transformation Output column is populated with one of five status as listed below. You can use a Conditional Split Transformation to separate out the output records based on its status. You can apply different business logic to each stream or insert them to a table of file destination for manual processing. To clean data using the SSIS DQS task follow these steps: Open a new SSIS package in Visual Studio Create a new Data Flow Task Add a Source Component to extract the data to be cleansed Add a DQS Cleansing transformation and connect it to the Source Component Output Edit the DQS Cleansing transformation and set the Data Quality connection manager and Data Quality Knowledge Base to use for the cleansing activity Click on the Mapping tab and map the Input Columns to the corresponding Knowledge Base Domain, then click OK Add a C...

Deploy SSIS packages by using the deployment utility, SQL Server, and file systems

- December 29, 2017

For project deployment, you must ensure that all the related objects needed for the package execution are also part of the package. For example, when a Connection Manager is created, it is specific for the package within which it was created, unless it is promoted to a project-level Connection Manager by right-clicking it and selecting Convert To Project Connection. Similarly, any parameter that you want to be included at the project level, such as, Server- Name, DatabaseName, FolderName, etc. can be created at the project level so it only needs to be changed in one place and will be available for all the packages in the project. And once the project has been deployed to a different server, for example to the Development, QA, or Production server, the parameter value only needs to be changed in one place. The Project.params tab opened with project parameters, and also under Connection Managers, you see the ...

Create an SSIS catalog

- December 29, 2017

SQL Server 2012 introduced a new paradigm in storing and deploying SSIS packages. Before that release, all packages were individually managed and deployed. The SSIS DB catalog and SSISDB database serves as a centralized place where deployed SSIS packages can be stored along with everything related to it, including: Project and package parameters Configuring runtime values for packages Executing and troubleshooting packages Managing Integration Services server operations You find the SSISDB catalog in the folder labeled Integration Services Catalogs. This is where you find the Projects folder containing each project with all its packages. And the Environments folder contains the package configuration and environment values, which can be set before executing the packages. Each SQL Server instance gets only one SSISDB. You also find another SSISDB in the Databases folder in SQL Server (Figure 2-113)....

Error handling at package level in SSIS

- December 29, 2017

There are many causes of errors and failures when a package is running, either in Debug Mode or in Production. One of the best and simplest ways to manage error handling is with the Event Handler tab in the SSIS Visual Studio design environment. Under the Event Handler drop-down list, you can see all of the event types that can be triggered to fire off a notification, email, or insert the error details into a file, or a database table. From this tab, you can choose from any (or all) of the components within your whole package to contain an event handler. Although each component can have its own custom event handler, generally the most common error handling method is to put the error handling at the highest level, the package, so that all of the errors float up and can be processed from one centralized location. In every other respect, you create Event Handlers just like any package—with Control Flows and ...

Implement error handling for data types in SSIS

- December 29, 2017

When data is being moved from a source to a destination, there you may experience different types of issues related to the datatype errors and column data truncation. When column data coming from the source is bigger than the target column it is going into, SSIS fails the package. But the Error Output page in the For Data Flow tasks provides a way to tell SSIS how to handle either errors or column truncation for each individual columns. The three options are: Ignore Failure Continues processing on error or truncating and disregard any errors Redirect Row Sends the offending row out to the output Fail Component Fails the task when an error or truncation occurs If you want to just send the error or truncated records to an error file or table, you can change the setting to Redirect Row. In this way, you can set up an error output destination. In the case of column truncation, you h...

Enable logging for package execution in SSIS

- December 29, 2017

Logging can be enabled and used at design time and runtime. To begin logging, in the Visual Studio menu select SSIS > Logging to open the Configure SSIS Logs window. In the Containers section, select the Control Flow objects for logging—the overall package itself and/or each of the individual tasks within the package. Then in the Providers and Logs tab, under the Add A New Log > Provider Type, select where you want the logging to be saved to, and click the Add button: After adding each of the Provider Types selected you want to log to, go to the Select the Logs to Use for the Container section to set up the Connection in the Configuration column. In the case of a Text file, select a file to write to. For SQL Server DB as the target, simply select the target database. You are not limited to selecting only one target location for the logging, multiple destinations can be added and they will all be wri...

Fix performance, connectivity, execution, and failed logic issues by using the debugger in SSIS

- December 28, 2017

Checking your execution results progress tab When you are testing your packages, and troubleshooting issues, the details of the execution results can be found in the last tab of the package design environment. A point of possible confusion can be how SSIS changes the name of this one tab depending on if the package is executing or the package execution has been stopped. The last tab is labeled as follows: Execution Results tab When the package is not executing you can see the results of the last run. Progress tab When a package execution is in-progress you can see the real-time execution results. Once the package execution has been stopped, you can still see these results under the Execution Results tab. Package execution reports in Management Studio Once you have a package executing successfully, you want to monitor and track its progress. SQL Server Management Studio (SSMS) provides a very detailed, color-coded package execution...

Implement an ETL solution that supports incremental data loading

- December 28, 2017

Design a control flow to load change data In creating an incremental load Data Flow, you can create a dynamic query within the OLE DB Source Editor under Connection Manager page, go to the Data Access Mode drop-down and select SQL Command. This opens up a free text pane below labeled SQL Command Text. Here we can enter the query against the source table with a variable parameter in the WHERE clause. The parameter is identified by the “?” in the SQL text. “?” is the syntax used by the SSIS expression language for a parameter. With the SQL query in place, click the Parameters button to map the SQL query parameter to the SSIS variable @LastModifiedDate, which will pass in the value of MAX(ModifiedDate) from the fact table. Make sure the Param Direction is set to Input because the SSIS variable value is being passed in to be used by the query. Load data by using Transa...

Implement an ETL solution that supports incremental data extraction

- December 27, 2017

Design fact table patterns Fact tables used in data warehouses can become the very large, very fast by order of magnitude. It’s important to implement best practices for fact table design at the onset, even if your fact table does not contain billions of rows. Here are some things to be aware and include in your fact table design: Consider partitioning your fact tables when they are larger than 100 GB. The partitions can be by any column, which provides a fairly equal distribution of your data. Usually the Date columns (not DateTime columns) are good candidates. This provides more granularity when creating indexes, loading, updating, and deleting data (by partitions) Faster data access for queries by limiting the IO to specific partitions Utilize partition switching to move data in and out almost instantaneously Implement sliding window using partition splitting and partition merging Build clustered indexes on the Date Key of the fact tables (not...

Determine appropriate scenarios for Transact-SQL joins versus SSIS lookup

- December 26, 2017

Some common design patterns for data warehouse loads invlove extracting data from one or more sources and performing business and dimension key column lookups and joins. In most instances, SSIS packages can be optimized to perform these lookups and joins efficiently by tuning package control and data flow properties. In addition, it might be necessary to ensure that database object maintenance activities are done in a regular basis including, index defragmentation and statistics update. The SSIS Lookup transform comes very handy when doing lookup of dimension keys. For example, when loading a fact table with customer transactions, we lookup the customer’s business key against the business keys stored in the customer dimension. If we find a match, we retrieve the customer’s surrogate key and insert it in the fact table along with the corresponding transactions. In some cases, the best alternative to overcome SSIS Lookup limit...

Non-blocking, semi-blocking, and blocking transforms in SSIS

- December 26, 2017

Not all Data Flow transforms are created equal. In fact, some are much faster and require a lot less memory (non-blocking transforms). Others are slower while consuming much more memory (blocking transforms), and some are in between (semi-blocking transforms). We don’t always have a choice if we need to use a certain transform to do a certain specified task, but it is good to know the differences between these types of transforms so that we can better understand performance and resource utilization issues. Synchronous and asynchronous There is another related aspect about Data Flow transforms, which is their ability to quickly process a row as they are coming into the transform, independently of any other rows that came before or after (synchronous transforms). The other type of transform needs to be dependent on some or all of the rows that come before and after (asynchronous transforms). On the whole, non-blocking transforms a...

Implement fuzzy grouping, fuzzy lookup with SSIS

- December 26, 2017

A frequent problem that you may experience when you bring data from multiple sources that need to be unified and consolidated and/or needs to be deduplicated. In the case of an enterprise maintaining multiple customer or product sales information in for various business entities, e.g. retail and web sales, or possibly different CRM programs that are a result of a company merger, all of this information has to be brought under one roof within the data warehouse. SQL Server provides two fuzzy transformations to help with such scenarios: Fuzzy grouping Fuzzy lookup These transformations can be used independently or in concert to assist with unification and deduplication of your data. Both the fuzzy transformation algorithms create and utilize temporary tables created within SQL Server. Fuzzy grouping Fuzzy grouping is used primarily for deduplicating and standardizing values in column data. Fuzzy grouping has input parameters a...