11.Implement a program to generate letter-level Markov text. 12.How would you use the tools and techniques of Section 15.1 to generate (order-0 or non-Markov)random text? 13.The program for generating word-level Markov text is at this book's web site.Try it on some of your documents. 14.How could you use hashing to speed up the Markov program? 15.Shannon's quote in Section 15.3 describes the algorithm he used to construct Markov text;implement that algorithm in a program.It gives a good approximation to the Markov frequencies,but not the exact form;explain why not.Implement a program that scans the entire string from scratch to generate each word (and thereby uses the true frequencies). 16.How would you use the techniques of this column to assemble a word list for a dictionary (the problem that Doug Mcllroy faced in Section 13.8)?How would you build a spelling checker without using a dictionary?How would you build a grammar checker without using rules of grammar? 17.Investigate how techniques related to k-gram analysis are used in applications such as speech recognition and data compression. Next:Section 15.6.Further Reading. Copyright 1999 Lucent Technologies.All rights reserved.Wed 18 Oct 2000
11. Implement a program to generate letter-level Markov text. 12. How would you use the tools and techniques of Section 15.1 to generate (order-0 or non-Markov) random text? 13. The program for generating word-level Markov text is at this book's web site. Try it on some of your documents. 14. How could you use hashing to speed up the Markov program? 15. Shannon's quote in Section 15.3 describes the algorithm he used to construct Markov text; implement that algorithm in a program. It gives a good approximation to the Markov frequencies, but not the exact form; explain why not. Implement a program that scans the entire string from scratch to generate each word (and thereby uses the true frequencies). 16. How would you use the techniques of this column to assemble a word list for a dictionary (the problem that Doug McIlroy faced in Section 13.8)? How would you build a spelling checker without using a dictionary? How would you build a grammar checker without using rules of grammar? 17. Investigate how techniques related to k-gram analysis are used in applications such as speech recognition and data compression. Next: Section 15.6. Further Reading. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000
Solutions (To Column 1 of Programming Programming Pearls) Pearls 1.This C program uses the Standard Library gsort to sort a file of integers. int intcomp(int *x,int *y) Second Edition { return *x-*y; nta[1000000]; int main(void) int i,n=0; Jon Bentley while(scanf("号d",&a[n])!=EoF) n++ qsort (a,n,sizeof(int),intcomp); for (i=0;i n;i++) printf("d\n",a[i]); return 0; This C++program uses the set container from the Standard Template Library for the same job. int main(void) { set S; int i; set::iterator j; while (cin >i) S.insert(i); for (j=S.begin();j!=S.end();++j) cout<<*j<<"\n": return 0; Solution 3 sketches the performance of both programs. 2.These functions use the constants to set,clear and test the value of a bit: #define BITSPERWORD 32 #define SHIFT 5 #define MASK 0x1F #define N 10000000 int a [1 N/BITSPERWORD]; void set(int i){ a[i>>SHIFT]=(1<<(i&MASK));)
Solutions (To Column 1 of Programming Pearls) 1. This C program uses the Standard Library qsort to sort a file of integers. int intcomp(int *x, int *y) { return *x - *y; } int a[1000000]; int main(void) { int i, n=0; while (scanf("%d", &a[n]) != EOF) n++; qsort(a, n, sizeof(int), intcomp); for (i = 0; i < n; i++) printf("%d\n", a[i]); return 0; } This C++ program uses the set container from the Standard Template Library for the same job. int main(void) { set S; int i; set::iterator j; while (cin >> i) S.insert(i); for (j = S.begin(); j != S.end(); ++j) cout << *j << "\n"; return 0; } Solution 3 sketches the performance of both programs. 2. These functions use the constants to set, clear and test the value of a bit: #define BITSPERWORD 32 #define SHIFT 5 #define MASK 0x1F #define N 10000000 int a[1 + N/BITSPERWORD]; void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i){ a[i>>SHIFT]&=~(1<<(i&MASK));) int test(int i){return a[i>>SHIFT](1<<(i&MASK)); 3.This C code implements the sorting algorithm,using the functions defined in Solution 2 int main(void) { int i; for (i=0;i<N;i++) clr(i); while (scanf("d",&i)!=EOF) set(i); for (i=0;i<N;i++) if (test(i)) printf("号d\n",i); return 0; I used the program in Solution 4 to generate a file of one million distinct positive integers,each less than ten million.This table reports the cost of sorting them with the system command-line sort,the C++and C programs in Solution 1,and the bitmap code: System Sort C++/STL C/qsort C/bitmaps Total Secs 89 38 12.610.7 Compute Secs 79 28 2.4 .5 Megabytes.8 70 4 1.25 The first line reports the total time,and the second line subtracts out the 10.2 seconds of input/output required to read and write the files.Even though the general C++program uses 50 times the memory and CPU time of the specialized C program,it requires just half the code and is much easier to extend to other problems. 4.See Column 12,especially Problem 12.8.This code assumes that randintl,u)returns a random integer in 1..u. for i [0,n) x[i]i for i=[0,k) swap(i,randint (i,n-1)) print x[i] The swap function exchanges the two elements in x.The randint function is discussed in detail in Section 12.1. 5.Representing all ten million numbers with a bitmap requires that many bits,or 1.25 million bytes.Employing the fact that no phone numbers begin with the digits zero or one reduces the memory to exactly one million bytes. Alternatively,a two-pass algorithm first sorts the integers 0 through 4,999,999 using 5,000,000/8=625,000 words of storage,then sorts 5,000,000 through 9,999,999 in a second pass.A k-pass algorithm sorts at most n nonrepeated positive integers less than n in time kn and space n/k. 6.If each integer appears at most ten times,then we can count its occurrences in a four-bit half-byte (or nybble)
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); } int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); } 3. This C code implements the sorting algorithm, using the functions defined in Solution 2. int main(void) { int i; for (i = 0; i < N; i++) clr(i); while (scanf("%d", &i) != EOF) set(i); for (i = 0; i < N; i++) if (test(i)) printf("%d\n", i); return 0; } I used the program in Solution 4 to generate a file of one million distinct positive integers, each less than ten million. This table reports the cost of sorting them with the system command-line sort, the C++ and C programs in Solution 1, and the bitmap code: System Sort C++/STL C/qsort C/bitmaps Total Secs 89 38 12.6 10.7 Compute Secs 79 28 2.4 .5 Megabytes .8 70 4 1.25 The first line reports the total time, and the second line subtracts out the 10.2 seconds of input/output required to read and write the files. Even though the general C++ program uses 50 times the memory and CPU time of the specialized C program, it requires just half the code and is much easier to extend to other problems. 4. See Column 12, especially Problem 12.8. This code assumes that randint(l, u) returns a random integer in l..u. for i = [0, n) x[i] = i for i = [0, k) swap(i, randint(i, n-1)) print x[i] The swap function exchanges the two elements in x. The randint function is discussed in detail in Section 12.1. 5. Representing all ten million numbers with a bitmap requires that many bits, or 1.25 million bytes. Employing the fact that no phone numbers begin with the digits zero or one reduces the memory to exactly one million bytes. Alternatively, a two-pass algorithm first sorts the integers 0 through 4,999,999 using 5,000,000/8 = 625,000 words of storage, then sorts 5,000,000 through 9,999,999 in a second pass. A k-pass algorithm sorts at most n nonrepeated positive integers less than n in time kn and space n/k. 6. If each integer appears at most ten times, then we can count its occurrences in a four-bit half-byte (or nybble)
Using the solution to Problem 5,we can sort the complete file in a single pass with 10,000,000/2 bytes,or in k passes with 10,000,000/2k bytes. 9.The effect of initializing the vector data[0..n-1]can be accomplished with a signature contained in two additional n-element vectors,from and to,and an integer top.If the element data[i]has been initialized,then from[i]top and to[from[i]]=i.Thus from is a simple signature,and to and top together make sure that from is not accidentally signed by the random contents of memory.Blank entries of data are uninitialized in this picture: data. 3 2 8 from: to top The variable top is initially zero;the array element i is first accessed by the code from[i]top to[top]i data[i]=0 top++ This problem and solution are from Exercise 2.12 of Aho,Hopcroft and Ullman's Design and Analysis of Computer Algorithms,published by Addison-Wesley in 1974.It combines key indexing and a wily signature scheme.It can be used for matrices as well as vectors. 10.The store placed the paper order forms in a 10x10 array of bins,using the last two digits of the customer's phone number as the hash index.When the customer telephoned an order,it was placed in the proper bin.When the customer arrived to retrieve the merchandise,the salesperson sequentially searched through the orders in the appropriate bin--this is classical'open hashing with collision resolution by sequential search".The last two digits of the phone number are quite close to random and therefore an excellent hash function,while the first two digits would be a horrible hash function--why?Some municipalities use a similar scheme to record deeds in sets of record books. 11.The computers at the two facilities were linked by microwave,but printing the drawings at the test base would have required a printer that was very expensive at the time.The team therefore drew the pictures at the main plant, photographed them,and sent 35mm film to the test station by carrier pigeon,where it was enlarged and printed photographically.The pigeon's 45-minute flight took half the time of the car,and cost only a few dollars per day. During the 16 months of the project the pigeons transmitted several hundred rolls of film,and only two were lost (hawks inhabit the area;no classified data was carried).Because of the low price of modern printers,a current solution to the problem would probably use the microwave link. 12.According to the urban legend,the Soviets solved their problem with a pencil,of course.For background on the
Using the solution to Problem 5, we can sort the complete file in a single pass with 10,000,000/2 bytes, or in k passes with 10,000,000/2k bytes. 9. The effect of initializing the vector data[0..n-1] can be accomplished with a signature contained in two additional n-element vectors, from and to, and an integer top. If the element data[i] has been initialized, then from[i] < top and to[from[i]]=i. Thus from is a simple signature, and to and top together make sure that from is not accidentally signed by the random contents of memory. Blank entries of data are uninitialized in this picture: The variable top is initially zero; the array element i is first accessed by the code from[i] = top to[top] = i data[i] = 0 top++ This problem and solution are from Exercise 2.12 of Aho, Hopcroft and Ullman's Design and Analysis of Computer Algorithms, published by Addison-Wesley in 1974. It combines key indexing and a wily signature scheme. It can be used for matrices as well as vectors. 10. The store placed the paper order forms in a 10x10 array of bins, using the last two digits of the customer's phone number as the hash index. When the customer telephoned an order, it was placed in the proper bin. When the customer arrived to retrieve the merchandise, the salesperson sequentially searched through the orders in the appropriate bin -- this is classical ``open hashing with collision resolution by sequential search''. The last two digits of the phone number are quite close to random and therefore an excellent hash function, while the first two digits would be a horrible hash function -- why? Some municipalities use a similar scheme to record deeds in sets of record books. 11. The computers at the two facilities were linked by microwave, but printing the drawings at the test base would have required a printer that was very expensive at the time. The team therefore drew the pictures at the main plant, photographed them, and sent 35mm film to the test station by carrier pigeon, where it was enlarged and printed photographically. The pigeon's 45-minute flight took half the time of the car, and cost only a few dollars per day. During the 16 months of the project the pigeons transmitted several hundred rolls of film, and only two were lost (hawks inhabit the area; no classified data was carried). Because of the low price of modern printers, a current solution to the problem would probably use the microwave link. 12. According to the urban legend, the Soviets solved their problem with a pencil, of course. For background on the
true story,learn about the Fisher Space Pen company.The company was founded in 1948,and its writing instruments have been used by the Russian Space Agency,underwater explorers,and Himalayan climbing expeditions. Copyright 1999 Lucent Technologies.All rights reserved.Thu 23 Sep 1999
true story, learn about the Fisher Space Pen company. The company was founded in 1948, and its writing instruments have been used by the Russian Space Agency, underwater explorers, and Himalayan climbing expeditions. Copyright © 1999 Lucent Technologies. All rights reserved. Thu 23 Sep 1999