Handling of Skew The distribution of tuples to disks may be skewed -that is,some disks have many tuples,while others may have fewer tuples. Types of skew: Attribute-value skew. Some values appear in the partitioning attributes of many tuples;all the tuples with the same value for the partitioning attribute end up in the same partition. Can occur with range-partitioning and hash-partitioning Partition skew. With range-partitioning,badly chosen partition vector may assign too many tuples to some partitions and too few to others. Less likely with hash-partitioning if a good hash-function is chosen. Database System Concepts-6th Edition 18.12 @Silberschatz,Korth and Sudarshan
Database System Concepts - 6 18.12 ©Silberschatz, Korth and Sudarshan th Edition Handling of Skew The distribution of tuples to disks may be skewed — that is, some disks have many tuples, while others may have fewer tuples. Types of skew: Attribute-value skew. Some values appear in the partitioning attributes of many tuples; all the tuples with the same value for the partitioning attribute end up in the same partition. Can occur with range-partitioning and hash-partitioning. Partition skew. With range-partitioning, badly chosen partition vector may assign too many tuples to some partitions and too few to others. Less likely with hash-partitioning if a good hash-function is chosen
Handling Skew in Range-Partitioning To create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation): Sort the relation on the partitioning attribute. Construct the partition vector by scanning the relation in sorted order as follows. After every 1/nth of the relation has been read,the value of the partitioning attribute of the next tuple is added to the partition vector. n denotes the number of partitions to be constructed. Duplicate entries or imbalances can result if duplicates are present in partitioning attributes. Alternative technique based on histograms used in practice Database System Concepts-6th Edition 18.13 @Silberschatz,Korth and Sudarshan
Database System Concepts - 6 18.13 ©Silberschatz, Korth and Sudarshan th Edition Handling Skew in Range-Partitioning To create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation): Sort the relation on the partitioning attribute. Construct the partition vector by scanning the relation in sorted order as follows. After every 1/n th of the relation has been read, the value of the partitioning attribute of the next tuple is added to the partition vector. n denotes the number of partitions to be constructed. Duplicate entries or imbalances can result if duplicates are present in partitioning attributes. Alternative technique based on histograms used in practice
Handling Skew using Histograms Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation,or sampling (blocks containing)tuples of the relation 50 40 Kouanbalj 30 20 10 1-5 6-10 11-1516-2021-25 value Database System Concepts-6th Edition 18.14 @Silberschatz,Korth and Sudarshan
Database System Concepts - 6 18.14 ©Silberschatz, Korth and Sudarshan th Edition Handling Skew using Histograms Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation
Handling Skew Using Virtual Processor Partitioning Skew in range partitioning can be handled elegantly using virtual processor partitioning: create a large number of partitions(say 10 to 20 times the number of processors) Assign virtual processors to partitions either in round-robin fashion or based on estimated cost of processing each virtual partition Basic idea: If any normal partition would have been skewed,it is very likely the skew is spread over a number of virtual partitions Skewed virtual partitions get spread across a number of processors,so work gets distributed evenly! Database System Concepts-6th Edition 18.15 @Silberschatz,Korth and Sudarshan
Database System Concepts - 6 18.15 ©Silberschatz, Korth and Sudarshan th Edition Handling Skew Using Virtual Processor Partitioning Skew in range partitioning can be handled elegantly using virtual processor partitioning: create a large number of partitions (say 10 to 20 times the number of processors) Assign virtual processors to partitions either in round-robin fashion or based on estimated cost of processing each virtual partition Basic idea: If any normal partition would have been skewed, it is very likely the skew is spread over a number of virtual partitions Skewed virtual partitions get spread across a number of processors, so work gets distributed evenly!
Interquery Parallelism Queries/transactions execute in parallel with one another. Increases transaction throughput;used primarily to scale up a transaction processing system to support a larger number of transactions per second. Easiest form of parallelism to support,particularly in a shared-memory parallel database,because even sequential database systems support concurrent processing. More complicated to implement on shared-disk or shared-nothing architectures Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor. Cache-coherency has to be maintained-reads and writes of data in buffer must find latest version of data. Database System Concepts-6th Edition 18.16 @Silberschatz,Korth and Sudarshan
Database System Concepts - 6 18.16 ©Silberschatz, Korth and Sudarshan th Edition Interquery Parallelism Queries/transactions execute in parallel with one another. Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second. Easiest form of parallelism to support, particularly in a shared-memory parallel database, because even sequential database systems support concurrent processing. More complicated to implement on shared-disk or shared-nothing architectures Locking and logging must be coordinated by passing messages between processors. Data in a local buffer may have been updated at another processor. Cache-coherency has to be maintained — reads and writes of data in buffer must find latest version of data