Posts

Showing posts with the label SQL Server

Perform domain management in DQS

Image
A knowledge base consists of domains. Each domain represents the data in a data field. Each  value in a data field or domain is known as a domain value. DQS provides the ability to validate,  cleanse, match and deduplicate values from any dataset against domain values in the DQS  Knowledge Base. Domains are created by performing a domain management activity. To create a DQS domain  for valid supplier names in the Suppliers Knowledge Base, follow these steps: Open the Data Quality Services client and connect to the DQS instance Click on Open Knowledge Base button from the Knowledge Base Management section Select Suppliers from the Knowledge Base list Click on Domain Management from the Select Activity list, then click Next Click on the Create A Domain icon Type the following information on the Create Domain window: Domain Name: SupplierName Description: List of valid Supplier Names Select String from the Data Type drop-down list Check the Use Leading Values option C

Determine appropriate use cases for a DQS Knowledge Base

Data Quality Services can be used for a wide variety of use cases to ensure that the data is suited  for business usage. Data can be cleansed manually or computer-assisted by establishing matching  and cleansing rules or by leveraging cloud-based services of reference data providers. Incorrect or bad data is often a symptom of weak or lack of data validation rules during data  entry. Other forms of inconsistent data can arise from data integration processes and data corruption.  Bad, incorrect and inconsistent data create inefficiencies in the business processes and  can lead to loss of revenue and increased operational costs. Bad data is often amplified during  data analysis, reporting, data mining and data warehousing. The main use cases for a DQS Knowledge Base involves the need for business users, information  workers or IT professionals to create, maintain and execute data quality operations  resulting from data incompleteness, inconsistency, inaccuracy, invalidity, dupli

Create a Data Quality Services (DQS) knowledge base

Image
Install DQS A Data Quality Knowledge Base (DQKB), is a repository of knowledge and metadata that describes  and defines the data domains and rules to cleanse and match data. The DQKB is used in  an organization to identify and correct invalid or incorrect data to ensure that data is consistent,  and trustworthy. A DQKB stores domain values that have been validated along with terms,  spelling errors, business rules, and reference data that can be used to perform data quality  cleansing and matching actions on the data. A Data Quality Services knowledge base can be created as a new and empty knowledge  base or by importing an existing DQS knowledge base from a DQS (.dqs) file. In our example,  we will be creating a Suppliers DQS Knowledge base. To create a new and empty DQS knowledge  base, follow these steps: Open the Data Quality Services client and connect to the DQS instance Click on New Knowledge Base button from the Knowledge Base Management section Type the following

Design a partition structure that supports the quick loading and scale-out of data

Scaling-out the data refers to the mechanism put in place to spread the processing of data across as many servers as necessary to improve querying and data loading performance. It  should not be confused with scaling-up, which refers to increasing the resources of a server,  such as CPU and memory resources. Scaling-up is usually the initial step towards scaling a  database. As resource utilization demand increases, additional hardware, such as processing  power and memory, can satisfy this need. There are several factors that can dictate what data scale-out solution can be implemented.  The first factor depends on how often data changes. If data changes frequently certain scale out  solutions that require data replication might not work because data contention might  increase. Data that is inserted or updated frequently needs to be constantly replicated across  multiple databases. The cost of replicating changes might be too high and degrade performance  instead. A second facto

Implement partition elimination

Image
As previously described, partition aligned indexes used in queries that execute frequently can  have tremendous performance gains due to partition elimination. During partition elimination,  entire partitions are skipped from being read, and therefore the query plan becomes much  more efficient because only rows needed by the query are read. A query lists the Top 10 Customers by Revenue for the year 2015. Notice that the WHERE  clause aligns perfectly with the partition column and one of the boundary ranges of the partition  function. Only the 2015 partition should be read and the partitions for 2014 and 2016  should be eliminated from the query plan. Top 10 Customers by Revenue for 2015 SELECT TOP 10 [Customer Key], sum([Quantity] * [Unit Price]) as Revenue FROM [Fact].[SaleModified] WHERE [Invoice Date Key] between ‘2015-01-01’ and ‘2015-12-31’ and [Customer Key] <> 0 GROUP BY [Customer Key] ORDER BY Revenue DESC The query lists the Top 10 Customers b

Implement sliding windows

Image
A sliding window partitioning strategy is a common data management practice in most  data warehouse environments with a data retention policy that establishes how long data  is retained in operational database tables. The retention policy can establish how often  data is purged or archived and how far back data should be retained in the operational  database. Highly regulated industries, such as healthcare and finance, have legal requirements to  retain data for a certain period before it can be deleted. Organizations in these regulated industries  with high volumes of data often put a data archival and deletion strategy in place that  allows them to maintain compliance while keeping their operational databases streamlined  with data that is actively accessed. A sliding window partition strategy provides a mechanism  to move large volumes of data in and out of a table quickly and with minimal or zero disruption. For example, consider a sliding window partition strategy where

Design a partition structure to support a data warehouse

Image
A table partition consists of a partition column, a partition function, and a partition scheme.  The partition column is the column in the table that holds the values that are used as boundary  values to group rows of data in a corresponding partition. Only one column can be defined  as the partition column.  SQL Server uses the partition column to determine which partition to store new and  existing rows in. SQL Server also uses it for partition elimination to determine which partitions  need to be read based on the WHERE clause or JOIN condition of a query. The partition function specifies a set of range boundaries based on the values of the partition  column. The boundaries can be defined as a RANGE LEFT or RANGE RIGHT. A boundary defined  as RANGE LEFT means that the boundary value belongs to its left partition. The boundary  value is the last value in the left partition. A boundary defined as RANGE RIGHT means that the  boundary value belongs to its right partition. The bou

Design an appropriate storage solution, including hardware, disk, and file layout

Image
The data warehouse storage’s performance can be measured by  analyzing the time it takes for read and write operations to complete, known as latency , and by  analyzing how much data volume it can read or write in each time frame, known as throughput .  High latency and low throughput can be an indication of an improper selection for the  storage infrastructure or incorrect hardware and software configurations. Performance of a storage subsystem can  be estimated based on several criteria: Volume of data Rate of change and growth of the data Data retention policy Number of concurrent users Typical workloads expected Two important metrics associated with storage are throughput and Input/Output Per  Second (IOPS) . Throughput is a measure of the amount of data that can be transferred in a  period. This is typically measured in megabytes or gigabytes per second. IOPS is the number of  I/O read or write operations that can complete in a second.  It is important

Select appropriate indexes

Image
A query is considered selective if the execution plan utilizes a more efficient operation  such as an index seek rather than a table or index scan. SQL Server uses an internal cost-based  optimizer known as the SQL Query Optimizer to determine the execution plan with the lowest  cost while also taking in consideration the cost of finding candidate plans. The execution plan  operation chosen by the SQL Query Optimizer depends not just on its WHERE clause predicate  but also on other factors such indexes and column statistics. Examples of low selectivity queries include queries that aggregate sales data by year  or queries that average profit across products. Performance tuning of these non-selective queries  can be challenging using traditional rowstore indexes. For this matter, a columnstore index  is best suited because they are in-memory structures and outperform disk-based rowstore  index scans. For data warehouse workloads, a common index design approach is to implement a