Design a partition structure that supports the quick loading and scale-out of data

Scaling-out the data refers to the mechanism put in place to spread the processing of data
across as many servers as necessary to improve querying and data loading performance. It should not be confused with scaling-up, which refers to increasing the resources of a server, such as CPU and memory resources. Scaling-up is usually the initial step towards scaling a database. As resource utilization demand increases, additional hardware, such as processing power and memory, can satisfy this need.

There are several factors that can dictate what data scale-out solution can be implemented. The first factor depends on how often data changes. If data changes frequently certain scale out solutions that require data replication might not work because data contention might increase. Data that is inserted or updated frequently needs to be constantly replicated across multiple databases. The cost of replicating changes might be too high and degrade performance instead.

A second factor to consider is the ability to implement a scale-out solution that is transparent to the user or reporting applications. Some scale-out solutions might require changes to table relationships, stored procedures, and the way user queries are written. Certain reporting applications might not be able to support a scale-out solution that requires executing distributed queries or querying linked servers.

A third factor is the ability to partition the data in a way that aligns to typical query patterns. An effective scale-out solution involves partitioning data across several databases located on separate servers to handle queries that require a portion of the data. For example, data that is partitioned by customer geography, where each database on each server contains only customers from a region such as North America or European regions, works if all queries filter on region. If the queries typically select customers and their transactions across regions, this partition strategy does not work because it requires that all databases across all servers be queried.

In this last case, no performance benefits can be observed from partition elimination because no databases are eliminated from the query execution plan. There might still be a performance benefit from this scale-out solution if the combined processing power of all servers exceeds the added processing overhead of assembling the results.

A fourth and last factor in the type of scale-out solution that can be implemented is the interdependence of the data. In this scale-out solution, data is split based on how data is typically related and used within areas of the database. For example, fact tables use foreign key relationship constraints to enforce referential integrity. A partition strategy that involves splitting dimension and fact tables on different databases does not work in this case because referential integrity enforcement across databases is not supported. A better partition approach is to group related tables together that only relate to each other and don’t have relationships to the rest of the data model. This group of related tables that are isolated from the rest are often referred to as data marts.

There are many options to scale-out the data, but the six more frequently implemented scale-out solutions involve the following mechanisms:

  1. Allowing multiple database engines to access a single copy of the database.
  2. Replicating the database to multiple database servers.
  3. Implementing linked servers and executing distributed queries.
  4. Implementing distributed partitioned views.
  5. Routing queries to the correct database using some middleware service.
  6. Implementing a stretch database with Azure.

Comments

Popular Posts

Non-blocking, semi-blocking, and blocking transforms in SSIS

Implement additive, semi-additive, and non-additive measures

Implement an ETL solution that supports incremental data extraction