Handling of Skew The distribution of tuples to disks may be skewed-that is,some disks have many tuples,while others may have fewer tuples. Types of skew: Attribute-value skew. Some values appear in the partitioning attributes of many tuples;all the tuples with the same value for the partitioning attribute end up in the same partition. Can occur with range-partitioning and hash-partitioning. Partition skew. With range-partitioning,badly chosen partition vector may assign too many tuples to some partitions and too few to others. Less likely with hash-partitioning if a good hash-function is chosen. Database System Concepts-5th Edition,Aug 22,2005. 21.12 @Silberschatz,Korth and Sudarshan
Database System Concepts - 5 21.12 ©Silberschatz, Korth and Sudarshan th Edition, Aug 22, 2005. Handling of Skew The distribution of tuples to disks may be skewed — that is, some disks have many tuples, while others may have fewer tuples. Types of skew: Attribute-value skew. Some values appear in the partitioning attributes of many tuples; all the tuples with the same value for the partitioning attribute end up in the same partition. Can occur with range-partitioning and hash-partitioning. Partition skew. With range-partitioning, badly chosen partition vector may assign too many tuples to some partitions and too few to others. Less likely with hash-partitioning if a good hash-function is chosen
Handling Skew in Range-Partitioning To create a balanced partitioning vector(assuming partitioning attribute forms a key of the relation): Sort the relation on the partitioning attribute. Construct the partition vector by scanning the relation in sorted order as follows. After every 1/nth of the relation has been read,the value of the partitioning attribute of the next tuple is added to the partition vector. n denotes the number of partitions to be constructed. Duplicate entries or imbalances can result if duplicates are present in partitioning attributes. Alternative technique based on histograms used in practice Database System Concepts-5th Edition,Aug 22,2005. 21.13 @Silberschatz,Korth and Sudarshan
Database System Concepts - 5 21.13 ©Silberschatz, Korth and Sudarshan th Edition, Aug 22, 2005. Handling Skew in Range-Partitioning To create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation): Sort the relation on the partitioning attribute. Construct the partition vector by scanning the relation in sorted order as follows. After every 1/n th of the relation has been read, the value of the partitioning attribute of the next tuple is added to the partition vector. n denotes the number of partitions to be constructed. Duplicate entries or imbalances can result if duplicates are present in partitioning attributes. Alternative technique based on histograms used in practice
Handling Skew using Histograms Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation,or sampling(blocks containing)tuples of the relation 50 40 30 20 10 1-5 6-10 11-1516-2021-25 value Database System Concepts-5th Edition,Aug 22,2005. 21.14 @Silberschatz,Korth and Sudarshan
Database System Concepts - 5 21.14 ©Silberschatz, Korth and Sudarshan th Edition, Aug 22, 2005. Handling Skew using Histograms Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation
Handling Skew Using Virtual Processor Partitioning Skew in range partitioning can be handled elegantly using virtual processor partitioning: create a large number of partitions(say 10 to 20 times the number of processors) Assign virtual processors to partitions either in round-robin fashion or based on estimated cost of processing each virtual partition Basic idea: If any normal partition would have been skewed,it is very likely the skew is spread over a number of virtual partitions Skewed virtual partitions get spread across a number of processors,so work gets distributed evenly! Database System Concepts-5th Edition,Aug 22,2005. 21.15 @Silberschatz,Korth and Sudarshan
Database System Concepts - 5 21.15 ©Silberschatz, Korth and Sudarshan th Edition, Aug 22, 2005. Handling Skew Using Virtual Processor Partitioning Skew in range partitioning can be handled elegantly using virtual processor partitioning: create a large number of partitions (say 10 to 20 times the number of processors) Assign virtual processors to partitions either in round-robin fashion or based on estimated cost of processing each virtual partition Basic idea: If any normal partition would have been skewed, it is very likely the skew is spread over a number of virtual partitions Skewed virtual partitions get spread across a number of processors, so work gets distributed evenly!