当前位置：和泉文库 > 计算机 > 浏览文档

上海交通大学：《挖掘海量数据集 Mining Massive Datasets》课程教学资源（PPT讲稿）Lecture 06 搜索引擎 Search Engines

▪ Architecture of Search Engines ▪ Index Construction ▪ Boolean Retrieval ▪ Vector Space Model for Ranked Retrieval

文件格式：PPT，文件大小：2.14MB，售价：18.96元

文档详细内容（约85页）

Search Engines Architecture Web crawler Starts with a set of seeds, which are a set of urls given to it as parameters Seeds are added to a url request queue Crawler starts fetching pages from the request queue Downloaded pages are parsed to find link tags that might contain other useful Urls to fetch New URLs added to the crawler's request queue, or frontier Continue until no more new urls or disk full

Search Engines 11 Web Crawler ▪ Starts with a set of seeds, which are a set of URLs given to it as parameters ▪ Seeds are added to a URL request queue ▪ Crawler starts fetching pages from the request queue ▪ Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch ▪ New URLs added to the crawler’s request queue, or frontier ▪ Continue until no more new URLs or disk full Architecture

Search Engines Architecture Crawling picture URLs crawled and parsed Unseen Web Seed URLS frontier pages Web

Search Engines 12 Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages Architecture

Search Engines Architecture Crawling the Web /index. html /couses /index html /index html www.cs.umassedh /2006/09/story. html /index html /2003/04/sto ry html www.bbc.couk /news html www.crnscom /abouthtml www.whitehouse.gov crawlersearhengine com

Search Engines 13 Crawling the Web Architecture

Search Engines Architecture Text transformation Parser Processing the sequence of text tokens in the document to recognize structural elements e.g., titles, links, headings, etc Tokenizer recognizes words" in the text must consider issues like capitalization, hyphens, apostrophes non-alpha characters, separators Markup languages such as HTML, XML often used to specify structure Tags used to specify document elements E.g., <h2> Overview </h2> Document parser uses syntax of markup language(or other formatting to identify structure

Search Engines 14 Text Transformation ▪ Parser ▪ Processing the sequence of text tokens in the document to recognize structural elements ▪ e.g., titles, links, headings, etc. ▪ Tokenizer recognizes “words” in the text ▪ must consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators ▪ Markup languages such as HTML, XML often used to specify structure ▪ Tags used to specify document elements ▪ E.g., <h2> Overview </h2> ▪ Document parser uses syntax of markup language (or other formatting) to identify structure Architecture

Search Engines Architecture Text transformation Stopping Remove common words(stop words) e.g,and,"or","the","int a Some impact on efficiency and effectiveness Can be a problem for some queries Stemming Group words derived from a common stem e.g,computer ,computers", computing ,"compute Usually effective but not for all queries Benefits vary for different languages

Search Engines 15 Text Transformation ▪ Stopping ▪ Remove common words (stop words) ▪ e.g., “and”, “or”, “the”, “in” ▪ Some impact on efficiency and effectiveness ▪ Can be a problem for some queries ▪ Stemming ▪ Group words derived from a common stem ▪ e.g., “computer”, “computers”, “computing”, “compute” ▪ Usually effective, but not for all queries ▪ Benefits vary for different languages Architecture

点击进入文档下载页（PPT格式）

共85页，可试读20页，点击继续阅读 ↓↓

您可能感兴趣的文档

《Introduction to Java Programming》课程PPT教学课件（Sixth Edition）Chapter 16 Applets and Multimedia
《计算机组装与维护》课程教学资源（PPT课件讲稿）第9章 BIOS设置（设置BIOS）
香港城市大学：基序检测的随机化算法（PPT讲稿）Randomized Algorithm for Motif Detection
《数据结构》课程教学资源（PPT课件讲稿）第七章图及其应用
3D Reconstruction from Images：Image-based Street-side City Modeling
大连理工大学：《计算机网络》课程教学资源（PPT课件讲稿）Chapter 2 应用层 application layer
四川大学：《操作系统 Operating System》课程教学资源（PPT课件讲稿）Chapter 3 Process Description and Control 3.4 Process Control 3.5 Execution of the Operating System 3.6 Unix SVR4 Process Management 3.7 Linux Process management system calls
《数据结构》课程教学资源（PPT课件讲稿）第七章图 Graph
《数据结构》课程教学资源：实践教学大纲
《网络算法学》课程教学资源（PPT课件讲稿）第三章实现原则
《电脑组装与维护实例教程》教学资源（PPT课件讲稿）第5章多媒体设备介绍及选购
广西医科大学：《计算机网络 Computer Networking》课程教学资源（PPT课件讲稿）Chapter 02 Network Classification
《计算机系统安全》课程教学资源（PPT课件讲稿）第二章黑客常用的系统攻击方法
《C语言程序设计》课程教学资源（PPT课件讲稿）第8章结构体、共用体与枚举类型
香港浸会大学：Introduction to Linux and PC Cluster
南京大学：《计算机图形学》课程教学资源（PPT课件讲稿）第7讲图元填充与裁剪算法
北京航空航天大学：SimplyDroid - Efficient Event Sequence Simplification for Android Application
《The C++ Programming Language》课程教学资源（PPT课件讲稿）Lecture 04 Object-Based Programming
中国科学技术大学：Linux内核源代码导读（PPT讲稿，陈香兰）
《网上开店实务》课程教学资源（PPT讲稿）学习情境3 网店装修
北京大学：《项目成本管理》课程教学资源（PPT课件讲稿）项目范围计划（主讲：周立新）
香港中文大学：Achieving Secure and Cooperative Wireless Networks with Trust Modeling and Game Theory
MSCIT 5210/MSCBD 5002：Knowledge Discovery and Data Mining：Chapter 4：Data Warehousing, On-line Analytical Processing and Data Cube
《程序设计基础》课程PPT教学课件（C++）第3讲 C++程序控制结构

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录