Posts

Showing posts with the label ETL

Clean data by using the SSIS DQS task

Image
SQL Server Integration Services (SSIS) includes a DQS Transformation to correct data from a  connected data source. The DQS Transformation Output column is populated with one of five  status as listed below. You can use a Conditional Split Transformation to separate  out the output records based on its status. You can apply different business logic to each  stream or insert them to a table of file destination for manual processing. To clean data using the SSIS DQS task follow these steps: Open a new SSIS package in Visual Studio Create a new Data Flow Task Add a Source Component to extract the data to be cleansed Add a DQS Cleansing transformation and connect it to the Source Component Output Edit the DQS Cleansing transformation and set the Data Quality connection manager and Data Quality Knowledge Base to use for the cleansing activity Click on the Mapping tab and map the Input Columns to the corresponding Knowledge Base Domain, then click OK Add a C...

Deploy SSIS packages by using the deployment utility, SQL Server, and file systems

Image
For project deployment, you  must ensure that all the related objects needed for the package execution are also part of the  package. For example, when a Connection Manager is created, it is specific for the package  within which it was created, unless it is promoted to a project-level Connection Manager by  right-clicking it and selecting Convert To Project Connection. Similarly, any parameter that you want to be included at the project level, such as, Server- Name, DatabaseName, FolderName, etc. can be created at the project level so it only needs to  be changed in one place and will be available for all the packages in the project. And once the  project has been deployed to a different server, for example to the Development, QA, or Production  server, the parameter value only needs to be changed in one place. The Project.params tab opened with project parameters, and  also under Connection Managers, you see the ...

Create an SSIS catalog

Image
SQL Server 2012 introduced a new paradigm in storing and deploying SSIS packages. Before  that release, all packages were individually managed and deployed. The SSIS DB catalog and SSISDB database serves as a centralized place where deployed SSIS  packages can be stored along with everything related to it, including: Project and package parameters Configuring runtime values for packages Executing and troubleshooting packages Managing Integration Services server operations You find the SSISDB catalog in the folder labeled Integration Services Catalogs. This is where  you find the Projects folder containing each project with all its packages. And the Environments  folder contains the package configuration and environment values, which can be set  before executing the packages. Each SQL Server instance gets only one SSISDB. You also find another SSISDB in the Databases folder in SQL Server (Figure 2-113)....

Error handling at package level in SSIS

Image
There are many causes of errors and failures when a package is running, either in Debug Mode  or in Production. One of the best and simplest ways to manage error handling is with the Event  Handler tab in the SSIS Visual Studio design environment. Under the Event Handler drop-down  list, you can see all of the event types that can be triggered to fire off a notification, email, or  insert the error details into a file, or a database table. From this tab, you can choose from any (or all) of the components within your whole package  to contain an event handler. Although each component can have its own custom event  handler, generally the most common error handling method is to put the error handling at  the highest level, the package, so that all of the errors float up and can be processed from one  centralized location. In every other respect, you create Event Handlers just like any package—with Control Flows  and ...

Implement error handling for data types in SSIS

Image
When data is being moved from a source to a destination, there you may experience different  types of issues related to the datatype errors and column data truncation. When column data  coming from the source is bigger than the target column it is going into, SSIS fails the package.  But the Error Output page in the For Data Flow tasks provides a way to tell SSIS how to handle  either errors or column truncation for each individual columns. The three options  are: Ignore Failure Continues processing on error or truncating and disregard any errors Redirect Row Sends the offending row out to the output Fail Component Fails the task when an error or truncation occurs If you want to just send the error or truncated records to an error file or table, you can  change the setting to Redirect Row. In this way, you can set up an error output destination. In the case of column truncation, you h...

Enable logging for package execution in SSIS

Image
Logging can be enabled and used at design time and runtime. To begin logging, in the Visual  Studio menu select SSIS > Logging to open the Configure SSIS Logs window. In  the Containers section, select the Control Flow objects for logging—the overall package itself  and/or each of the individual tasks within the package. Then in the Providers and Logs tab,  under the Add A New Log > Provider Type, select where you want the logging to be saved to,  and click the Add button: After adding each of the Provider Types selected you want to log to, go to the Select the  Logs to Use for the Container section to set up the Connection in the Configuration column.  In the case of a Text file, select a file to write to. For SQL Server DB as the target,  simply select the target database. You are not limited to selecting only one target location for  the logging, multiple destinations can be added and they will all be wri...

Fix performance, connectivity, execution, and failed logic issues by using the debugger in SSIS

Image
Checking your execution results progress tab When you are testing your packages, and troubleshooting issues, the details of the execution  results can be found in the last tab of the package design environment. A point of possible  confusion can be how SSIS changes the name of this one tab depending on if the package is  executing or the package execution has been stopped. The last tab is labeled as follows: Execution Results tab When the package is not executing you can see the results of  the last run. Progress tab When a package execution is in-progress you can see the real-time  execution results. Once the package execution has been stopped, you can still see these  results under the Execution Results tab. Package execution reports in Management Studio Once you have a package executing successfully, you want to monitor and track its progress.  SQL Server Management Studio (SSMS) provides a very detailed, color-coded package execution...

Implement an ETL solution that supports incremental data loading

Image
Design a control flow to load change data In creating an incremental load Data Flow, you can create a dynamic query within the OLE DB  Source Editor under Connection Manager page, go to the Data Access Mode drop-down and  select SQL Command. This opens up a free text pane below labeled SQL Command Text. Here  we can enter the query against the source table with a variable parameter in the WHERE clause.  The parameter is identified by the “?” in the SQL text. “?” is the syntax used by the SSIS expression  language for a parameter. With the SQL query in place, click the Parameters button to map the SQL query parameter  to the SSIS variable @LastModifiedDate, which will pass in the value of MAX(ModifiedDate)  from the fact table. Make sure the Param Direction is set to Input because the SSIS  variable value is being passed in to be used by the query. Load data by using Transa...

Implement an ETL solution that supports incremental data extraction

Image
Design fact table patterns Fact tables used in data warehouses can become the very large, very fast by order of magnitude.  It’s important to implement best practices for fact table design at the onset, even if your  fact table does not contain billions of rows. Here are some things to be aware and include in  your fact table design: Consider partitioning your fact tables when they are larger than 100 GB. The partitions can be by any column, which provides a fairly equal distribution of your data. Usually the Date columns (not DateTime columns) are good candidates. This provides more granularity when creating indexes, loading, updating, and deleting data (by partitions) Faster data access for queries by limiting the IO to specific partitions Utilize partition switching to move data in and out almost instantaneously Implement sliding window using partition splitting and partition merging Build clustered indexes on the Date Key of the fact tables (not...

Determine appropriate scenarios for Transact-SQL joins versus SSIS lookup

Some common design patterns for data warehouse loads invlove extracting data from one or  more sources and performing business and dimension key column lookups and joins. In most  instances, SSIS packages can be optimized to perform these lookups and joins efficiently by  tuning package control and data flow properties. In addition, it might be necessary to ensure  that database object maintenance activities are done in a regular basis including, index defragmentation  and statistics update. The SSIS Lookup transform comes very handy when doing lookup of dimension keys. For  example, when loading a fact table with customer transactions, we lookup the customer’s  business key against the business keys stored in the customer dimension. If we find a match, we  retrieve the customer’s surrogate key and insert it in the fact table along with the corresponding  transactions. In some cases, the best alternative to overcome SSIS Lookup limit...

Non-blocking, semi-blocking, and blocking transforms in SSIS

Not all Data Flow transforms are created equal. In fact, some are much faster and require a  lot less memory (non-blocking transforms). Others are slower while consuming much more  memory (blocking transforms), and some are in between (semi-blocking transforms). We don’t  always have a choice if we need to use a certain transform to do a certain specified task, but  it is good to know the differences between these types of transforms so that we can better  understand performance and resource utilization issues. Synchronous and asynchronous There is another related aspect about Data Flow transforms, which is their ability to quickly  process a row as they are coming into the transform, independently of any other rows that  came before or after (synchronous transforms). The other type of transform needs to be  dependent on some or all of the rows that come before and after (asynchronous transforms). On the whole, non-blocking transforms a...

Implement fuzzy grouping, fuzzy lookup with SSIS

Image
A frequent problem that you may experience when you bring data from multiple sources  that need to be unified and consolidated and/or needs to be deduplicated. In the case of an  enterprise maintaining multiple customer or product sales information in for various business  entities, e.g. retail and web sales, or possibly different CRM programs that are a result of  a company merger, all of this information has to be brought under one roof within the data  warehouse. SQL Server provides two fuzzy transformations to help with such scenarios: Fuzzy grouping Fuzzy lookup These transformations can be used independently or in concert to assist with unification  and deduplication of your data. Both the fuzzy transformation algorithms create and utilize  temporary tables created within SQL Server. Fuzzy grouping Fuzzy grouping is used primarily for deduplicating and standardizing values in column data. Fuzzy grouping has input parameters a...