Implement data profiling, parallelism, transactions, logging, and security

Data profiling
Allows you to evaluate the content, structure, and quality of the source dataset. It cannot identify inaccurate or bad data, just metrics about the existing data to see how useful or difficult it will be to use the data. This includes the frequency of the values, the datatypes, NULLs, and length of data in the columns.

SSIS has a Data Profiler Task that can assist with profiling your data. This task saves the results of the data analysis in an XML file that can be accessed by another program, Data Profile Viewer.

Here are the types of information you can glean from it:
  • Column Length Distribution Profile
  • Column Null Ratio Profile
  • Column Pattern Profile
  • Column Statistics Profile
  • Column Value Distribution Profile
  • Candidate Key Profile
  • Functional Dependency Profile
  • Value Inclusion Profile
What is parallelism in SSIS?

Parallelism is simply running multiple SSIS tasks concurrently (together in parallel). So when you have multiple tables to be loaded that don’t depend on each other for their data (Customers, Products, Cities), they can be loaded at the same time, instead of loading Products, then loading Cities, and finally Customers. 

Running SSIS packages in parallel can be a big time saver, but can also be resource intensive depending on the number of tasks running in parallel, the table sizes, complexity of the transforms, network bandwidth, CPU, and disk IO speeds. As with all development, when you create packages to run in parallel, you need to monitor your server resources to see if the parallelization is maxing out any of the resources, which could cause everything to run slower thereby defeating the purpose of parallelizing tasks in the first place.

Control flow properties for parallelism
The performance of the Control Flow of an SSIS package can be optimized by setting a very important property known as MaxConcurrentExectutables

MaxConcurrentExecutables sets the number of tasks that can run concurrently in the Control Flow. When you use the default value MaxConcurrentExecutables (-1) that tells SSIS to set the number of parallel tasks to equal the number of Logical Processors in your CPU + 2 (it overcommits a bit in default mode). If your server has 16 logical processors, the maximum number of parallel tasks is 18 (16 + 2). But if you set MaxConcurrentExecutables = 10, it only allows a maximum of 10 tasks to run at the same time. This allows you to leave some CPU resources for other services.

Data flow properties for parallelism
EngineThreads is a property of each data flow that tells how many threads the data flow engine can create and run in parallel. The default value for EngineThreads = 10, which sets the amount of threads for the source (10) and for the worker threads (10), so the default setting can spawn 20 threads. 

A good rule of thumb is not to run more threads in parallel than the number of available processors. Having excessive threads running in parallel can cause context-switching between threads. The maximum number of EngineThreads is twice the number of logical processors. So if your server has 16 logical processors, the maximum EngineThreads is 32 (16 * 2).

The control flow tasks run in parallel until it reaches the MaxCuncurrentExecutable number of control flow items (18 for your sever, 16+2). But if the DataFlow EngineThreads limit is reached by the Control Flow tasks (EngineThread = 10 & DataFlow1 uses 6 EngineThreads & DataFlow2 uses 4 EngineThreads), none of the other DataFlows can run until some of the EngineThreads get released. So the parallelization depends on the number of available DataFlow EngineThreads and the ControlFlow MaxCuncurrentExecutable value.

Security
SSIS packages can be secured with a password. In doing so, there are a few options in securing your packages.
  • DontSaveSensitive Removes sensitive information upon saving
  • EncryptSensitiveWithUserKey Encrypts sensitive information with user profile key
  • EncryptSensitiveWithPassword Encrypts sensitive information with a password
  • EncryptAllWithPassword Encrypts the whole package with a password
  • EncryptAllWithUserKey Encrypts the whole package with user profile key
When the package is password-protected, you have to enter the password when opening the package to view and edit it. By default, there is a password used to secure the package, which is fine for development. But packages being put into production can use password protection. This can be difficult when working with individual packages, but as we’ll see later in the package deployment section, you can use Project level deployment, which can include all of the project items to secure and deploy the whole project.

Comments

Popular Posts

Non-blocking, semi-blocking, and blocking transforms in SSIS

Implement additive, semi-additive, and non-additive measures

Implement an ETL solution that supports incremental data extraction