Word Frequencies (Illustrating Section 15.1 of Programming Programming Pearls) Section 15.1 describes word frequencies and gives an example.Here are some more examples,generated by this program from the King James Bible(789,616 words),the Iliad(153,210 words),and Programming Pearls (79,525 words).This table shows the 40 most common words,and their Second Edition percentage in the document. BIBLE号 ILIAD 号 PEARLS the 7.86 the 6.25 the 5.64 Jon Bentley and 4.88 and 4.23 of2.76 of4.35 of3.64 a2.43 t01.69 to2.15 t02.18 And 1.61 his 1.62 and 2.01 that 1.57 he1.60 in1.96 in1.54 in1.42 is1.36 sha111.24 a1.17 that 1.20 he1.20 with 1.04 The 0.91 unto 1.13 that 0.94 for 0.83 I1.10 you 0.91 by0.67 his 1.06 for 0.89 it0.57 a1.01 him 0.84 this 0.56 for 0.90 I0.83 with 0.51 they 0.87 as0.79 an0.50 be0.84 was 0.77 program 0.48 is0.84 they 0.66 on0.48 with 0.75 on0.64 be0.47 not 0.74 from 0.64 are 0.47 a110.66 but 0.58 as0.45 thou 0.59 son 0.58 we0.43 thy 0.56 had 0.58 at0.39 was 0.56 not 0.56 can 0.39 it0.56 their 0.56 time 0.38 which 0.54 it0.56 I0.36 my0.52 is0.50 was 0.35 LORD 0.50 wi110.49 you 0.33 their 0.49 have 0.48 code 0.33 have 0.48 a110.47 from 0.28 wi110.47 by0.47 or0.26 from 0.45 them 0.46 its 0.26 ye0.45 were 0.43 one 0.25 them 0.44 at0.43 This 0.25 as0.41 who 0.42 array 0.25 him 0.40 my0.40 have 0.25 are 0.36 so0.40 two 0.25 upon 0.34 your 0.38 each 0.25 were 0.34 when 0.37 first 0.24 by0.32 be0.36 search 0.24 when 0.31 upon 0.35 function 0.23
Word Frequencies (Illustrating Section 15.1 of Programming Pearls) Section 15.1 describes word frequencies and gives an example. Here are some more examples, generated by this program from the King James Bible (789,616 words), the Iliad (153,210 words), and Programming Pearls (79,525 words). This table shows the 40 most common words, and their percentage in the document. BIBLE % ILIAD % PEARLS % the 7.86 the 6.25 the 5.64 and 4.88 and 4.23 of 2.76 of 4.35 of 3.64 a 2.43 to 1.69 to 2.15 to 2.18 And 1.61 his 1.62 and 2.01 that 1.57 he 1.60 in 1.96 in 1.54 in 1.42 is 1.36 shall 1.24 a 1.17 that 1.20 he 1.20 with 1.04 The 0.91 unto 1.13 that 0.94 for 0.83 I 1.10 you 0.91 by 0.67 his 1.06 for 0.89 it 0.57 a 1.01 him 0.84 this 0.56 for 0.90 I 0.83 with 0.51 they 0.87 as 0.79 an 0.50 be 0.84 was 0.77 program 0.48 is 0.84 they 0.66 on 0.48 with 0.75 on 0.64 be 0.47 not 0.74 from 0.64 are 0.47 all 0.66 but 0.58 as 0.45 thou 0.59 son 0.58 we 0.43 thy 0.56 had 0.58 at 0.39 was 0.56 not 0.56 can 0.39 it 0.56 their 0.56 time 0.38 which 0.54 it 0.56 I 0.36 my 0.52 is 0.50 was 0.35 LORD 0.50 will 0.49 you 0.33 their 0.49 have 0.48 code 0.33 have 0.48 all 0.47 from 0.28 will 0.47 by 0.47 or 0.26 from 0.45 them 0.46 its 0.26 ye 0.45 were 0.43 one 0.25 them 0.44 at 0.43 This 0.25 as 0.41 who 0.42 array 0.25 him 0.40 my 0.40 have 0.25 are 0.36 so 0.40 two 0.25 upon 0.34 your 0.38 each 0.25 were 0.34 when 0.37 first 0.24 by 0.32 be 0.36 search 0.24 when 0.31 upon 0.35 function 0.23
Copyright 1999 Lucent Technologies.All rights reserved.Mon 6 Nov 2000
Copyright © 1999 Lucent Technologies. All rights reserved. Mon 6 Nov 2000
Phrases (Section 15.2 of Programming Programming Pearls) Words are the basic component of documents,and many important problems can be solved by searching for words. Sometimes,however,I search long strings(my documents,or help files,or web pages,or even the entire web)for phrases, Second Edition such as'substring searching"or'implicit data structures". How would you search through a large body of text to find'a phrase of several words"?If you had never seen the body of text before,you would have no choice but to start at the beginning Jon Bentley and scan through the whole input;most algorithms texts describe many approaches to thissubstring searching problem". Suppose,though,that you had a chance to preprocess the body of text before performing searches.You could make a hash table(or search tree)to index every distinct word of the document,and store a list of every occurrence of each word.Such an''inverted index"allows a program to look up a given word quickly.One can look up phrases by intersecting the lists of the words they contain,but this is subtle to implement and potentially slow.(Some web search engines do,however,take exactly this approach.) We'll turn now to a powerful data structure and apply it to a small problem:given an input file of text,find the longest duplicated substring of characters in it.For instance,the longest repeated string in'Ask not what your country can do for you,but what you can do for your country"is'can do for you",with'your country"a close second place.How would you write a program to solve this problem? [Click for more examples of long repeated strings] This problem is reminiscent of the anagram problem that we saw in Section 2.4.If the input string is stored in cf0..n-1/,then we could start by comparing every pair of substrings using pseudocode like this maxlen =-1 for i=[0,n) for j=(i,n) if (thislen comlen(&c[i],&c[j]))>maxlen maxlen thislen maxi =i maxj =j The comlen function returns the length that its two parameter strings have in common,starting with their first characters: int comlen(char *p,char *q)
Phrases (Section 15.2 of Programming Pearls) Words are the basic component of documents, and many important problems can be solved by searching for words. Sometimes, however, I search long strings (my documents, or help files, or web pages, or even the entire web) for phrases, such as ``substring searching'' or ``implicit data structures''. How would you search through a large body of text to find ``a phrase of several words''? If you had never seen the body of text before, you would have no choice but to start at the beginning and scan through the whole input; most algorithms texts describe many approaches to this ``substring searching problem''. Suppose, though, that you had a chance to preprocess the body of text before performing searches. You could make a hash table (or search tree) to index every distinct word of the document, and store a list of every occurrence of each word. Such an ``inverted index'' allows a program to look up a given word quickly. One can look up phrases by intersecting the lists of the words they contain, but this is subtle to implement and potentially slow. (Some web search engines do, however, take exactly this approach.) We'll turn now to a powerful data structure and apply it to a small problem: given an input file of text, find the longest duplicated substring of characters in it. For instance, the longest repeated string in ``Ask not what your country can do for you, but what you can do for your country'' is `` can do for you'', with `` your country'' a close second place. How would you write a program to solve this problem? [Click for more examples of long repeated strings] This problem is reminiscent of the anagram problem that we saw in Section 2.4. If the input string is stored in c[0..n-1], then we could start by comparing every pair of substrings using pseudocode like this maxlen = -1 for i = [0, n) for j = (i, n) if (thislen = comlen(&c[i], &c[j])) > maxlen maxlen = thislen maxi = i maxj = j The comlen function returns the length that its two parameter strings have in common, starting with their first characters: int comlen(char *p, char *q)
1=0 while*p&&(*p++==*q++) i++ return i Because this algorithm looks at all pairs of substrings,it takes time proportional to n2,at least.We might be able to speed it up by using a hash table to search for words in the phrases,but we'll instead take an entirely new approach. Our program will process at most MAXN characters,which it stores in the array c: #define MAXN 5000000 char c[MAXN],*a[MAXN]; We'll use a simple data structure known as a'suffix array";the structure has been used at least since the 1970's, though the term was introduced in the 1990's.The structure is an array a of pointers to characters.As we read the input,we initialize a so that each element points to the corresponding character in the input string: while (ch getchar())!=EOF a[n]=&c[n] c[n++]ch c[n]=0 The final element of c contains a null character,which terminates all strings. The element a/0/points to the entire string;the next element points to the suffix of the array beginning with the second character,and so on.On the input string'banana",the array will represent these suffixes: a[0]:banana a[1]:anana a[2]:nana a[3]:ana a[4]:na a[5]:a The pointers in the array a together point to every suffix in the string,hence the name'suffix array". If a long string occurs twice in the array c,it appears in two different suffixes.We will therefore sort the array to bring together equal suffixes (just as sorting brought together anagrams in Section 2.4).The'banana"array sorts to a[0]:a a[1]:ana a[2]:anana a[3]:banana a[4]:na a[5]:nana
i = 0 while *p && (*p++ == *q++) i++ return i Because this algorithm looks at all pairs of substrings, it takes time proportional to n2, at least. We might be able to speed it up by using a hash table to search for words in the phrases, but we'll instead take an entirely new approach. Our program will process at most MAXN characters, which it stores in the array c: #define MAXN 5000000 char c[MAXN], *a[MAXN]; We'll use a simple data structure known as a ``suffix array''; the structure has been used at least since the 1970's, though the term was introduced in the 1990's. The structure is an array a of pointers to characters. As we read the input, we initialize a so that each element points to the corresponding character in the input string: while (ch = getchar()) != EOF a[n] = &c[n] c[n++] = ch c[n] = 0 The final element of c contains a null character, which terminates all strings. The element a[0] points to the entire string; the next element points to the suffix of the array beginning with the second character, and so on. On the input string ``banana'', the array will represent these suffixes: a[0]: banana a[1]: anana a[2]: nana a[3]: ana a[4]: na a[5]: a The pointers in the array a together point to every suffix in the string, hence the name ``suffix array''. If a long string occurs twice in the array c, it appears in two different suffixes. We will therefore sort the array to bring together equal suffixes (just as sorting brought together anagrams in Section 2.4). The ``banana'' array sorts to a[0]: a a[1]: ana a[2]: anana a[3]: banana a[4]: na a[5]: nana
We can then scan through this array comparing adjacent elements to find the longest repeated string,which in this case is 'ana". We'll sort the suffix array with the gsort function: qsort (a,n,sizeof(char *)pstrcmp) The pstrcmp comparison function adds one level of indirection to the library strcmp function.This scan through the array uses the comlen function to count the number of letters that two adjacent words have in common: for i=[0,n) if comlen(a[i],a[i+1])>maxlen maxlen comlen(a[i],a[i+1]) maxi i printf (".*s\n",maxlen,a[maxi]) The printf statement uses the'*"precision to print maxlen characters of the string. I ran the resulting program to find the longest repeated string in the 807,503 characters in Samuel Butler's translation of Homer's Iliad.The program took 4.8 seconds to locate this string: whose sake so many of the Achaeans have died at Troy,far from their homes?Go about at once among the host,and speak fairly to them,man by man,that they draw not their ships into the sea. The text first occurs when Juno suggests it to Minerva as an argument that might keep the Greeks(Achaeans)from departing from Troy;it occurs shortly thereafter when Minerva repeats the argument verbatim to Ulysses.On this and other typical text files of n characters,the algorithm runs in O(n log n)time,due to sorting. [Click for more examples of long repeated strings] Suffix arrays represent every substring in n characters of input text using the text itself and n additional pointers. Problem 6 investigates how suffix arrays can be used to solve the substring searching problem.We'll turn now to a more subtle application of suffix arrays. Next:Section 15.3.Generating Random Text. Copyright1999 Lucent Technologies.All rights reserved.Wed 18 Oct 2000
We can then scan through this array comparing adjacent elements to find the longest repeated string, which in this case is ``ana''. We'll sort the suffix array with the qsort function: qsort(a, n, sizeof(char *), pstrcmp) The pstrcmp comparison function adds one level of indirection to the library strcmp function. This scan through the array uses the comlen function to count the number of letters that two adjacent words have in common: for i = [0, n) if comlen(a[i], a[i+1]) > maxlen maxlen = comlen(a[i], a[i+1]) maxi = i printf("%.*s\n", maxlen, a[maxi]) The printf statement uses the ``*'' precision to print maxlen characters of the string. I ran the resulting program to find the longest repeated string in the 807,503 characters in Samuel Butler's translation of Homer's Iliad. The program took 4.8 seconds to locate this string: whose sake so many of the Achaeans have died at Troy, far from their homes? Go about at once among the host, and speak fairly to them, man by man, that they draw not their ships into the sea. The text first occurs when Juno suggests it to Minerva as an argument that might keep the Greeks (Achaeans) from departing from Troy; it occurs shortly thereafter when Minerva repeats the argument verbatim to Ulysses. On this and other typical text files of n characters, the algorithm runs in O(n log n) time, due to sorting. [Click for more examples of long repeated strings] Suffix arrays represent every substring in n characters of input text using the text itself and n additional pointers. Problem 6 investigates how suffix arrays can be used to solve the substring searching problem. We'll turn now to a more subtle application of suffix arrays. Next: Section 15.3. Generating Random Text. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000