当前位置：和泉文库 > 计算机 > 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源（PPT讲稿）Lecture 06 Index Compression

《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源（PPT讲稿）Lecture 06 Index Compression

文件格式：PPT，文件大小：746KB，售价：10.37元

文档详细内容（约46页）

Index Compression Collection Statistics Heaps Law Fig 5.1 p81 For rcv1, the dashed line oguM=0.49g107+164 is the best least squares fit. Thus,M=10.6470490k= 10164≈44andb=049 Good empirical fit for Reuters rcv1 For first 1,000.020 tokens law predicts 38, 323 terms actually, 38, 365 terms log10T

Index Compression 11 Heaps’ Law For RCV1, the dashed line log10M = 0.49 log10T + 1.64 is the best least squares fit. Thus, M = 101.64T 0.49 so k = 101.64 ≈ 44 and b = 0.49. Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms Fig 5.1 p81 Collection Statistics

Index Compression Collection Statistics Exercises Compute the vocabulary size M for this scenario Looking at a collection of web pages you find that there are 3000 different terms in the first 10,000 tokens and 30.000 different terms in the first 1,000,000 tokens Assume a search engine indexes a total of 20,000,000,000 (2 X 1010 )pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by heaps law?

Index Compression 12 Exercises ▪ Compute the vocabulary size M for this scenario: ▪ Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. ▪ Assume a search engine indexes a total of 20,000,000,000 (2 × 1010) pages, containing 200 tokens on average ▪ What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? Collection Statistics

Index Compression Collection Statistics Zipfs law Heaps' law gives the vocabulary size in collections We also study the relative frequencies of terms In natural language, there are a few very frequent terms and very many very rare terms Zipf's law: The ith most frequent term has frequency proportional to 1/ cf, a 1/ i=i where k is a normalizing constant ct; is collection frequency the number of occurrences of the term t in the collection

Index Compression 13 Zipf’s law ▪ Heaps’ law gives the vocabulary size in collections. ▪ We also study the relative frequencies of terms. ▪ In natural language, there are a few very frequent terms and very many very rare terms. ▪ Zipf’s law: The ith most frequent term has frequency proportional to 1/i . ▪ cfi ∝ 1/i = K/i where K is a normalizing constant ▪ cfi is collection frequency: the number of occurrences of the term ti in the collection. Collection Statistics

Index Compression Collection Statistics Zipf consequences If the most frequent term(the) occurs cf, times then the second most frequent term of occurs c, /2 times the third most frequent term (and)occurs cf /3 times Equivalent: cf k/i where k is a normalizing factor SO log cf, log k-log i Linear relationship between log cf and log Another power law relationship

Index Compression 14 Zipf consequences ▪ If the most frequent term (the) occurs cf1 times ▪ then the second most frequent term (of) occurs cf1 /2 times ▪ the third most frequent term (and) occurs cf1 /3 times … ▪ Equivalent: cfi = K/i where K is a normalizing factor, so ▪ log cfi = log K - log i ▪ Linear relationship between log cfi and log i ▪ Another power law relationship Collection Statistics

Index Compression Collection Statistics Zipf's law for reuters rcv1 0 3 6 log 10 rank

Index Compression 15 Zipf’s law for Reuters RCV1 Collection Statistics

点击进入文档下载页（PPT格式）

共46页，可试读17页，点击继续阅读 ↓↓

您可能感兴趣的文档

清华大学：《计算机导论》课程电子教案（PPT教学课件）第1章计算机发展简史
《网页设计与制作》课程教学资源（PPT课件讲稿）第一章 HTML基础
北京大学：文本挖掘技术（PPT讲稿）文本分类 Text Categorization
同济大学：《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源（PPT课件讲稿）K-means & EM
中国医科大学计算机中心：《虚拟现实与增强现实技术概论》课程教学资源（PPT课件讲稿）第3章虚拟现实系统的输出设备
香港中文大学：XML for Interoperable Digital Video Library
上海交通大学：《计算机图形学 Computer Graphics》课程教学资源（PPT讲稿）CHAPTER 4 THE VISUALIZATION PIPELINE
《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源（PPT讲稿）Lecture 09 Evaluation
长春工业大学：《网页设计与制作》课程教学资源（PPT课件）第5章 Div+CSS布局技术
合肥工业大学：《计算机网络技术》课程教学资源（PPT课件讲稿）第4章交换网的运行
山东大学软件学院：非线性规划（PPT讲稿）一维搜索方法
《并发控制技术》课程教学资源（PPT课件讲稿）第7章事务管理 transaction management
嵌入式交叉开发环境的建立（PPT实验讲稿）
西安交通大学：《微型计算机接口技术》课程教学资源（PPT课件讲稿）第五章输入/输出控制接口
《TCP/IP协议及其应用》课程教学资源（PPT课件讲稿）第3章 IP寻址与地址解析
中国医科大学：《计算机网络实用教程》课程教学资源（PPT讲稿）高速局域网技术、交换式局域网技术、虚拟局域网技术、主要的城域网技术
《大学计算机基础》课程教学资源：作业习题
《计算机网络》课程教学资源（PPT课件讲稿）第一章计算机网络概述
山西国际商务职业学院：《数据库应用程序设计》课程教学资源（PPT课件）第三章数据与数据运算
《C语言程序设计》课程电子教案（PPT课件讲稿）Chapter 02 用C语言编写程序
《数字图像处理》课程教学资源（PPT课件讲稿）第5章图像复原
《数据结构 Data Structure》课程教学资源（PPT课件讲稿）06 非二叉树 Non-Binary Trees
《数据库系统概论 An Introduction to Database System》课程教学资源（PPT课件讲稿）第六讲关系数据理论
南京大学：《面向对象技术 OOT》课程教学资源（PPT课件讲稿）并发对象 Concurrent Objects

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录