Index Compression Collection Statistics Heaps Law Fig 5.1 p81 For rcv1, the dashed line oguM=0.49g107+164 is the best least squares fit. Thus,M=10.6470490k= 10164≈44andb=049 Good empirical fit for Reuters rcv1 For first 1,000.020 tokens law predicts 38, 323 terms actually, 38, 365 terms log10T
Index Compression 11 Heaps’ Law For RCV1, the dashed line log10M = 0.49 log10T + 1.64 is the best least squares fit. Thus, M = 101.64T 0.49 so k = 101.64 ≈ 44 and b = 0.49. Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms Fig 5.1 p81 Collection Statistics
Index Compression Collection Statistics Exercises Compute the vocabulary size M for this scenario Looking at a collection of web pages you find that there are 3000 different terms in the first 10,000 tokens and 30.000 different terms in the first 1,000,000 tokens Assume a search engine indexes a total of 20,000,000,000 (2 X 1010 )pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by heaps law?
Index Compression 12 Exercises ▪ Compute the vocabulary size M for this scenario: ▪ Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. ▪ Assume a search engine indexes a total of 20,000,000,000 (2 × 1010) pages, containing 200 tokens on average ▪ What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? Collection Statistics
Index Compression Collection Statistics Zipfs law Heaps' law gives the vocabulary size in collections We also study the relative frequencies of terms In natural language, there are a few very frequent terms and very many very rare terms Zipf's law: The ith most frequent term has frequency proportional to 1/ cf, a 1/ i=i where k is a normalizing constant ct; is collection frequency the number of occurrences of the term t in the collection
Index Compression 13 Zipf’s law ▪ Heaps’ law gives the vocabulary size in collections. ▪ We also study the relative frequencies of terms. ▪ In natural language, there are a few very frequent terms and very many very rare terms. ▪ Zipf’s law: The ith most frequent term has frequency proportional to 1/i . ▪ cfi ∝ 1/i = K/i where K is a normalizing constant ▪ cfi is collection frequency: the number of occurrences of the term ti in the collection. Collection Statistics
Index Compression Collection Statistics Zipf consequences If the most frequent term(the) occurs cf, times then the second most frequent term of occurs c, /2 times the third most frequent term (and)occurs cf /3 times Equivalent: cf k/i where k is a normalizing factor SO log cf, log k-log i Linear relationship between log cf and log Another power law relationship
Index Compression 14 Zipf consequences ▪ If the most frequent term (the) occurs cf1 times ▪ then the second most frequent term (of) occurs cf1 /2 times ▪ the third most frequent term (and) occurs cf1 /3 times … ▪ Equivalent: cfi = K/i where K is a normalizing factor, so ▪ log cfi = log K - log i ▪ Linear relationship between log cfi and log i ▪ Another power law relationship Collection Statistics
Index Compression Collection Statistics Zipf's law for reuters rcv1 0 3 6 log 10 rank
Index Compression 15 Zipf’s law for Reuters RCV1 Collection Statistics