iBI

Posts

Showing posts with the label SQL Server

Perform domain management in DQS

- December 31, 2017

A knowledge base consists of domains. Each domain represents the data in a data field. Each value in a data field or domain is known as a domain value. DQS provides the ability to validate, cleanse, match and deduplicate values from any dataset against domain values in the DQS Knowledge Base. Domains are created by performing a domain management activity. To create a DQS domain for valid supplier names in the Suppliers Knowledge Base, follow these steps: Open the Data Quality Services client and connect to the DQS instance Click on Open Knowledge Base button from the Knowledge Base Management section Select Suppliers from the Knowledge Base list Click on Domain Management from the Select Activity list, then click Next Click on the Create A Domain icon Type the following information on the Create Domain window: Domain Name: SupplierName Description: List of valid Supplier Names Select String from the Data Type drop-down list Check the Use Leadi...

Determine appropriate use cases for a DQS Knowledge Base

- December 31, 2017

Data Quality Services can be used for a wide variety of use cases to ensure that the data is suited for business usage. Data can be cleansed manually or computer-assisted by establishing matching and cleansing rules or by leveraging cloud-based services of reference data providers. Incorrect or bad data is often a symptom of weak or lack of data validation rules during data entry. Other forms of inconsistent data can arise from data integration processes and data corruption. Bad, incorrect and inconsistent data create inefficiencies in the business processes and can lead to loss of revenue and increased operational costs. Bad data is often amplified during data analysis, reporting, data mining and data warehousing. The main use cases for a DQS Knowledge Base involves the need for business users, information workers or IT professionals to create, maintain and execute data quality operations resulting from data incompleteness, incon...

Create a Data Quality Services (DQS) knowledge base

- December 31, 2017

Install DQS A Data Quality Knowledge Base (DQKB), is a repository of knowledge and metadata that describes and defines the data domains and rules to cleanse and match data. The DQKB is used in an organization to identify and correct invalid or incorrect data to ensure that data is consistent, and trustworthy. A DQKB stores domain values that have been validated along with terms, spelling errors, business rules, and reference data that can be used to perform data quality cleansing and matching actions on the data. A Data Quality Services knowledge base can be created as a new and empty knowledge base or by importing an existing DQS knowledge base from a DQS (.dqs) file. In our example, we will be creating a Suppliers DQS Knowledge base. To create a new and empty DQS knowledge base, follow these steps: Open the Data Quality Services client and connect to the DQS instance Click on New Knowledge Base button from the Knowledge Base ...

Design a partition structure that supports the quick loading and scale-out of data

- December 22, 2017

Scaling-out the data refers to the mechanism put in place to spread the processing of data across as many servers as necessary to improve querying and data loading performance. It should not be confused with scaling-up, which refers to increasing the resources of a server, such as CPU and memory resources. Scaling-up is usually the initial step towards scaling a database. As resource utilization demand increases, additional hardware, such as processing power and memory, can satisfy this need. There are several factors that can dictate what data scale-out solution can be implemented. The first factor depends on how often data changes. If data changes frequently certain scale out solutions that require data replication might not work because data contention might increase. Data that is inserted or updated frequently needs to be constantly replicated across multiple databases. The cost of replicating changes might be too high and degrade...

Implement partition elimination

- December 22, 2017

As previously described, partition aligned indexes used in queries that execute frequently can have tremendous performance gains due to partition elimination. During partition elimination, entire partitions are skipped from being read, and therefore the query plan becomes much more efficient because only rows needed by the query are read. A query lists the Top 10 Customers by Revenue for the year 2015. Notice that the WHERE clause aligns perfectly with the partition column and one of the boundary ranges of the partition function. Only the 2015 partition should be read and the partitions for 2014 and 2016 should be eliminated from the query plan. Top 10 Customers by Revenue for 2015 SELECT TOP 10 [Customer Key], sum([Quantity] * [Unit Price]) as Revenue FROM [Fact].[SaleModified] WHERE [Invoice Date Key] between ‘2015-01-01’ and ‘2015-12-31’ and [Customer Key] <> 0 GROUP BY [Customer Key] ORDER BY Revenue DESC The query...

Implement sliding windows

- December 22, 2017

A sliding window partitioning strategy is a common data management practice in most data warehouse environments with a data retention policy that establishes how long data is retained in operational database tables. The retention policy can establish how often data is purged or archived and how far back data should be retained in the operational database. Highly regulated industries, such as healthcare and finance, have legal requirements to retain data for a certain period before it can be deleted. Organizations in these regulated industries with high volumes of data often put a data archival and deletion strategy in place that allows them to maintain compliance while keeping their operational databases streamlined with data that is actively accessed. A sliding window partition strategy provides a mechanism to move large volumes of data in and out of a table quickly and with minimal or zero disruption. For example, conside...

Design a partition structure to support a data warehouse

- December 21, 2017

A table partition consists of a partition column, a partition function, and a partition scheme. The partition column is the column in the table that holds the values that are used as boundary values to group rows of data in a corresponding partition. Only one column can be defined as the partition column. SQL Server uses the partition column to determine which partition to store new and existing rows in. SQL Server also uses it for partition elimination to determine which partitions need to be read based on the WHERE clause or JOIN condition of a query. The partition function specifies a set of range boundaries based on the values of the partition column. The boundaries can be defined as a RANGE LEFT or RANGE RIGHT. A boundary defined as RANGE LEFT means that the boundary value belongs to its left partition. The boundary value is the last value in the left partition. A boundary defined as RANGE RIGHT means that the bound...

Design an appropriate storage solution, including hardware, disk, and file layout

- December 20, 2017

The data warehouse storage’s performance can be measured by analyzing the time it takes for read and write operations to complete, known as latency , and by analyzing how much data volume it can read or write in each time frame, known as throughput . High latency and low throughput can be an indication of an improper selection for the storage infrastructure or incorrect hardware and software configurations. Performance of a storage subsystem can be estimated based on several criteria: Volume of data Rate of change and growth of the data Data retention policy Number of concurrent users Typical workloads expected Two important metrics associated with storage are throughput and Input/Output Per Second (IOPS) . Throughput is a measure of the amount of data that can be transferred in a period. This is typically measured in megabytes or gigabytes per second. IOPS is the number of I/O read or write operations that can comp...

Select appropriate indexes

- December 19, 2017

A query is considered selective if the execution plan utilizes a more efficient operation such as an index seek rather than a table or index scan. SQL Server uses an internal cost-based optimizer known as the SQL Query Optimizer to determine the execution plan with the lowest cost while also taking in consideration the cost of finding candidate plans. The execution plan operation chosen by the SQL Query Optimizer depends not just on its WHERE clause predicate but also on other factors such indexes and column statistics. Examples of low selectivity queries include queries that aggregate sales data by year or queries that average profit across products. Performance tuning of these non-selective queries can be challenging using traditional rowstore indexes. For this matter, a columnstore index is best suited because they are in-memory structures and outperform disk-based rowstore index scans. For data warehouse workloads, a co...