26 CHAPTER ONE §1.5 of compares required by this"median-of-three"quicksort is described by the recurrence CN=N+1+ N-)k-(Ck-1+CN-) for N>3 (4) 1≤k≤N ( where (is the binomial coefficient that counts the number of ways to choose 3 out of N items.This is true because the probability that the kth smallest element is the partitioning element is now (N-k)(-1)/()(as opposed to 1/N for regular quicksort).We would like to be able to solve re- currences of this nature to be able to determine how large a sample to use and when to switch to insertion sort.However,such recurrences require more sophisticated techniques than the simple ones used so far.In Chapters 2 and 3,we will see methods for developing precise solutions to such recur- rences,which allow us to determine the best values for parameters such as the sample size and the cutoff for small subarrays.Extensive studies along these lines have led to the conclusion that median-of-three quicksort with a cutoff point in the range 10 to 20 achieves close to optimal performance for typical implementations. Radix-exchange sort.Another variant of quicksort involves taking advan- tage of the fact that the keys may be viewed as binary strings.Rather than comparing against a key from the file for partitioning,we partition the file so that all keys with a leading 0 bit precede all those with a leading 1 bit. Then these subarrays can be independently subdivided in the same way using the second bit,and so forth.This variation is referred to as"radix-exchange sort"or "radix quicksort."How does this variation compare with the basic algorithm?To answer this question,we first have to note that a different mathematical model is required,since keys composed of random bits are es- sentially different from random permutations.The "random bitstring"model is perhaps more realistic,as it reflects the actual representation,but the mod- els can be proved to be roughly equivalent.We will discuss this issue in more detail in Chapter 8.Using a similar argument to the one given above,we can show that the average number of bit compares required by this method is described by the recurrence =v+()a+G for N>1 with Co=C1=0. www.it-ebooks.info
șȝ C Ŕ ō Ŝ Š ő Ş O Ś ő §Ș.Ȝ of compares required by this “median-of-three” quicksort is described by the recurrence CN = N +1+ ∑ 1≤k≤N (N − k)(k − 1) (N 3 ) (Ck−1 +CN−k) for N > 3 (4) where (N 3 ) is the binomial coefficient that counts the number of ways to choose 3 out of N items. Ļis is true because the probability that the kth smallest element is the partitioning element is now (N − k)(k − 1)/ (N 3 ) (as opposed to 1/N for regular quicksort). We would like to be able to solve recurrences of this nature to be able to determine how large a sample to use and when to switch to insertion sort. However, such recurrences require more sophisticated techniques than the simple ones used so far. In Chapters 2 and 3, we will see methods for developing precise solutions to such recurrences, which allow us to determine the best values for parameters such as the sample size and the cutoff for small subarrays. Extensive studies along these lines have led to the conclusion that median-of-three quicksort with a cutoff point in the range 10 to 20 achieves close to optimal performance for typical implementations. Radix-exchange sort. Another variant of quicksort involves taking advantage of the fact that the keys may be viewed as binary strings. Rather than comparing against a key from the ŀle for partitioning, we partition the ŀle so that all keys with a leading 0 bit precede all those with a leading 1 bit. Ļen these subarrays can be independently subdivided in the same way using the second bit, and so forth. Ļis variation is referred to as “radix-exchange sort” or “radix quicksort.” How does this variation compare with the basic algorithm? To answer this question, we ŀrst have to note that a different mathematical model is required, since keys composed of random bits are essentially different from random permutations. Ļe “random bitstring” model is perhaps more realistic, as it rełects the actual representation, but the models can be proved to be roughly equivalent. We will discuss this issue in more detail in Chapter 8. Using a similar argument to the one given above, we can show that the average number of bit compares required by this method is described by the recurrence CN = N + 1 2N ∑ k ( N k ) (Ck + CN−k) for N > 1 with C0 = C1 = 0. www.it-ebooks.info
§1.6 ANALY SIS OF ALGO RITH MS 27 This turns out to be a rather more difficult recurrence to solve than the one given earlier-we will see in Chapter 3 how generating functions can be used to transform the recurrence into an explicit formula for CN,and in Chapters 4 and 8,we will see how to develop an approximate solution. One limitation to the applicability of this kind of analysis is that all of the preceding recurrence relations depend on the "randomness preservation" property of the algorithm:if the original file is randomly ordered,it can be shown that the subarrays after partitioning are also randomly ordered.The implementor is not so restricted,and many widely used variants of the algo- rithm do not have this property.Such variants appear to be extremely difficult to analyze.Fortunately(from the point ofview of the analyst),empirical stud- ies show that they also perform poorly.Thus,though it has not been analyt- ically quantified,the requirement for randomness preservation seems to pro- duce more elegant and efficient quicksort implementations.More important, the versions that preserve randomness do admit to performance improve- ments that can be fully quantified mathematically,as described earlier. Mathematical analysis has played an important role in the development of practical variants of quicksort,and we will see that there is no shortage of other problems to consider where detailed mathematical analysis is an important part of the algorithm design process. 1.6 Asymptotic Approximations.The derivation of the average running time of quicksort given earlier yields an exact result,but we also gave a more concise approximate expression in terms of well-known functions that still can be used to compute accurate numerical estimates.As we will see,it is often the case that an exact result is not available,or at least an approximation is far easier to derive and interpret.Ideally,our goal in the analysis of an algorithm should be to derive exact results;from a pragmatic point of view,it is perhaps more in line with our general goal of being able to make useful performance predications to strive to derive concise but precise approximate answers. To do so,we will need to use classical techniques for manipulating such approximations.In Chapter 4,we will examine the Euler-Maclaurin sum- mation formula,which provides a way to estimate sums with integrals.Thus, we can approximate the harmonic numbers by the calculation w* -dx =InN <k<N www.it-ebooks.info
§Ș.ȝ A Ś ō Ř ť ş ŕ ş ś Œ A Ř œ ś Ş ŕ Š Ŕ ř ş șȞ Ļis turns out to be a rather more difficult recurrence to solve than the one given earlier—we will see in Chapter 3 how generating functions can be used to transform the recurrence into an explicit formula for CN , and in Chapters 4 and 8, we will see how to develop an approximate solution. One limitation to the applicability of this kind of analysis is that all of the preceding recurrence relations depend on the “randomness preservation” property of the algorithm: if the original ŀle is randomly ordered, it can be shown that the subarrays after partitioning are also randomly ordered. Ļe implementor is not so restricted, and many widely used variants of the algorithm do not have this property. Such variants appear to be extremely difficult to analyze. Fortunately (from the point of view of the analyst), empirical studies show that they also perform poorly. Ļus, though it has not been analytically quantiŀed, the requirement for randomness preservation seems to produce more elegant and efficient quicksort implementations. More important, the versions that preserve randomness do admit to performance improvements that can be fully quantiŀed mathematically, as described earlier. Mathematical analysis has played an important role in the development of practical variants of quicksort, and we will see that there is no shortage of other problems to consider where detailed mathematical analysis is an important part of the algorithm design process. 1.6 Asymptotic Approximations. Ļe derivation of the average running time of quicksort given earlier yields an exact result, but we also gave a more concise approximate expression in terms of well-known functions that still can be used to compute accurate numerical estimates. As we will see, it is often the case that an exact result is not available, or at least an approximation is far easier to derive and interpret. Ideally, our goal in the analysis of an algorithm should be to derive exact results; from a pragmatic point of view, it is perhaps more in line with our general goal of being able to make useful performance predications to strive to derive concise but precise approximate answers. To do so, we will need to use classical techniques for manipulating such approximations. In Chapter 4, we will examine the Euler-Maclaurin summation formula, which provides a way to estimate sums with integrals. Ļus, we can approximate the harmonic numbers by the calculation HN = ∑ 1≤k≤N 1 k ≈ ∫ N 1 1 x dx = lnN. www.it-ebooks.info
28 CHAPTER ONE $r.6 But we can be much more precise about the meaning of ~and we can con- clude (for example)that HN InN+y+1/(2N)+O(1/N2)where =.57721...is a constant known in analysis as Euler's constant.Though the constants implicit in the O-notation are not specified,this formula provides a way to estimate the value of HN with increasingly improving accuracy as N increases.Moreover,if we want even better accuracy,we can derive a formula for HN that is accurate to within O(N-3)or indeed to within O(N)for any constant k.Such approximations,called asymptotic expansions,are at the heart of the analysis of algorithms,and are the subject of Chapter 4. The use of asymptotic expansions may be viewed as a compromise be- tween the ideal goal of providing an exact result and the practical requirement of providing a concise approximation.It turns out that we are normally in the situation of,on the one hand,having the ability to derive a more accurate expression if desired,but,on the other hand,not having the desire,because expansions with only a few terms(like the one for HN above)allow us to com- pute answers to within several decimal places.We typically drop back to using the notation to summarize results without naming irrational constants,as, for example,in Theorem 1.3. Moreover,exact results and asymptotic approximations are both subject to inaccuracies inherent in the probabilistic model(usually an idealization of reality)and to stochastic fluctuations.Table 1.1 shows exact,approximate, and empirical values for number of compares used by quicksort on random files of various sizes.The exact and approximate values are computed from the formulae given in Theorem 1.3;the "empirical"is a measured average, taken over 100 files consisting of random positive integers less than 10;this tests not only the asymptotic approximation that we have discussed,but also the"approximation"inherent in our use of the random permutation model, ignoring equal keys.The analysis of quicksort when equal keys are present is treated in Sedgewick [28]. Exercise 1.20 How many keys in a file of 10 random integers less than 106 are likely to be equal to some other key in the file?Run simulations,or do a mathematical analysis(with the help of a system for mathematical calculations),or do both. Exercise 1.21 Experiment with files consisting of random positive integers less than M for M =10,000,1000,100 and other values.Compare the performance of quick- sort on such files with its performance on random permutations of the same size. Characterize situations where the random permutation model is inaccurate. www.it-ebooks.info
șȟ C Ŕ ō Ŝ Š ő Ş O Ś ő §Ș.ȝ But we can be much more precise about the meaning of ≈, and we can conclude (for example) that HN = lnN + γ + 1/(2N) + O(1/N2 ) where γ = .57721 · · · is a constant known in analysis as Euler’s constant. Ļough the constants implicit in the O-notation are not speciŀed, this formula provides a way to estimate the value of HN with increasingly improving accuracy as N increases. Moreover, if we want even better accuracy, we can derive a formula for HN that is accurate to within O(N −3 ) or indeed to within O(N −k ) for any constant k. Such approximations, called asymptotic expansions, are at the heart of the analysis of algorithms, and are the subject of Chapter 4. Ļe use of asymptotic expansions may be viewed as a compromise between the ideal goal of providing an exact result and the practical requirement of providing a concise approximation. It turns out that we are normally in the situation of, on the one hand, having the ability to derive a more accurate expression if desired, but, on the other hand, not having the desire, because expansions with only a few terms (like the one for HN above) allow us to compute answers to within several decimal places. We typically drop back to using the ≈ notation to summarize results without naming irrational constants, as, for example, in Ļeorem 1.3. Moreover, exact results and asymptotic approximations are both subject to inaccuracies inherent in the probabilistic model (usually an idealization of reality) and to stochastic łuctuations. Table 1.1 shows exact, approximate, and empirical values for number of compares used by quicksort on random ŀles of various sizes. Ļe exact and approximate values are computed from the formulae given in Ļeorem 1.3; the “empirical” is a measured average, taken over 100 ŀles consisting of random positive integers less than 106 ; this tests not only the asymptotic approximation that we have discussed, but also the “approximation” inherent in our use of the random permutation model, ignoring equal keys. Ļe analysis of quicksort when equal keys are present is treated in Sedgewick [28]. Exercise 1.20 How many keys in a ŀle of 104 random integers less than 106 are likely to be equal to some other key in the ŀle? Run simulations, or do a mathematical analysis (with the help of a system for mathematical calculations), or do both. Exercise 1.21 Experiment with ŀles consisting of random positive integers less than M for M = 10,000, 1000, 100 and other values. Compare the performance of quicksort on such ŀles with its performance on random permutations of the same size. Characterize situations where the random permutation model is inaccurate. www.it-ebooks.info
§1.6 ANALY SIS OF ALGO RI THMS 29 Exercise 1.22 Discuss the idea of having a table similar to Table 1.1 for mergesort. In the theory of algorithms,O-notation is used to suppress detail of all sorts:the statement that mergesort requires O(NlogN)compares hides everything but the most fundamental characteristics of the algorithm,imple- mentation,and computer.In the analysis of algorithms,asymptotic expan- sions provide us with a controlled way to suppress irrelevant details,while preserving the most important information,especially the constant factors involved.The most powerful and general analytic tools produce asymptotic expansions directly,thus often providing simple direct derivations of concise but accurate expressions describing properties of algorithms.We are some- times able to use asymptotic estimates to provide more accurate descriptions of program performance than might otherwise be available. file size exact solution approximate empirical 10,000 175,771 175,746 176,354 20,000 379,250 379,219 374,746 30,000 593,188 593,157 583,473 40,000 813,921 813,890 794,560 50,000 1,039,713 1,039,677 1,010,657 60,000 1,269,564 1,269,492 1,231,246 70,000 1,502,729 1,502,655 1,451,576 80,000 1,738,777 1,738,685 1,672,616 90,000 1,977,300 1,977,221 1,901,726 100,000 2,218,033 2,217,985 2,126,160 Table 1.1 Average number of compares used by quicksort www.it-ebooks.info
§Ș.ȝ A Ś ō Ř ť ş ŕ ş ś Œ A Ř œ ś Ş ŕ Š Ŕ ř ş șȠ Exercise 1.22 Discuss the idea of having a table similar to Table 1.1 for mergesort. In the theory of algorithms, O-notation is used to suppress detail of all sorts: the statement that mergesort requires O(NlogN) compares hides everything but the most fundamental characteristics of the algorithm, implementation, and computer. In the analysis of algorithms, asymptotic expansions provide us with a controlled way to suppress irrelevant details, while preserving the most important information, especially the constant factors involved. Ļe most powerful and general analytic tools produce asymptotic expansions directly, thus often providing simple direct derivations of concise but accurate expressions describing properties of algorithms. We are sometimes able to use asymptotic estimates to provide more accurate descriptions of program performance than might otherwise be available. ŀle size exact solution approximate empirical 10,000 175,771 175,746 176,354 20,000 379,250 379,219 374,746 30,000 593,188 593,157 583,473 40,000 813,921 813,890 794,560 50,000 1,039,713 1,039,677 1,010,657 60,000 1,269,564 1,269,492 1,231,246 70,000 1,502,729 1,502,655 1,451,576 80,000 1,738,777 1,738,685 1,672,616 90,000 1,977,300 1,977,221 1,901,726 100,000 2,218,033 2,217,985 2,126,160 Table 1.1 Average number of compares used by quicksort www.it-ebooks.info
30 CHAPTER ONE §1.7 1.7 Distributions.In general,probability theory tells us that other facts about the distribution IINk of costs are also relevant to our understanding of performance characteristics of an algorithm.Fortunately,for virtually all of the examples that we study in the analysis of algorithms,it turns out that knowing an asymptotic estimate for the average is enough to be able to make reliable predictions.We review a few basic ideas here.Readers not familiar with probability theory are referred to any standard text-for example,[9]. The full distribution for the number of compares used by quicksort for small N is shown in Figure 1.2.For each value of N,the points CNk/N!are plotted:the proportion of the inputs for which quicksort uses k compares. Each curve,being a full probability distribution,has area 1.The curves move to the right,since the average 2NInN +O(N)increases with N.A slightly different view of the same data is shown in Figure 1.3,where the horizontal axes for each curve are scaled to put the mean approximately at the center and shifted slightly to separate the curves.This illustrates that the distribution converges to a"limiting distribution." For many of the problems that we study in this book,not only do lim- iting distributions like this exist,but also we are able to precisely characterize them.For many other problems,including quicksort,that is a significant challenge.However,it is very clear that the distribution is concentrated near .05 100 200 300 400 Figure 1.2 Distributions for compares in quicksort,15<N 50 www.it-ebooks.info
Țȗ C Ŕ ō Ŝ Š ő Ş O Ś ő §Ș.Ȟ 1.7 Distributions. In general, probability theory tells us that other facts about the distribution Nk of costs are also relevant to our understanding of performance characteristics of an algorithm. Fortunately, for virtually all of the examples that we study in the analysis of algorithms, it turns out that knowing an asymptotic estimate for the average is enough to be able to make reliable predictions. We review a few basic ideas here. Readers not familiar with probability theory are referred to any standard text—for example, [9]. Ļe full distribution for the number of compares used by quicksort for small N is shown in Figure 1.2. For each value of N, the points CNk/N! are plotted: the proportion of the inputs for which quicksort uses k compares. Each curve, being a full probability distribution, has area 1. Ļe curves move to the right, since the average 2NlnN + O(N) increases with N. A slightly different view of the same data is shown in Figure 1.3, where the horizontal axes for each curve are scaled to put the mean approximately at the center and shifted slightly to separate the curves. Ļis illustrates that the distribution converges to a “limiting distribution.” For many of the problems that we study in this book, not only do limiting distributions like this exist, but also we are able to precisely characterize them. For many other problems, including quicksort, that is a signiŀcant challenge. However, it is very clear that the distribution is concentrated near 0 100 200 300 400 .1 .05 0 Figure 1.2 Distributions for compares in quicksort, 15 ≤ N ≤ 50 www.it-ebooks.info