当前位置：和泉文库 > 计算机 > 浏览文档

印第安纳大学：《Informatics》课程PPT教学课件（信息学）08 网络爬虫 Web Crawling

• Motivation and taxonomy of crawlers • Basic crawlers and implementation issues • Universal crawlers • Preferential (focused and topical) crawlers • Evaluation of preferential crawlers • Crawler ethics and conflicts • New developments: social, collaborative, federated crawlers

文件格式：PPT，文件大小：4.33MB，售价：19.14元

共86页，可试读20页，点击往前阅读 ↑↑

文档详细内容（约86页）

seed URLs Basic crawlers itialize frontier This is a sequential frontier dequeue URL from frontier crawler fetch page Seeds can be any list of starting URls J extract URLs and add to frontier Order of page visits is determined by frontier epository store data structure done? Stop criterion can be anything Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics

Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer Basic crawlers • This is a sequential crawler • Seeds can be any list of starting URLs • Order of page visits is determined by frontier data structure • Stop criterion can be anything

Breadth first search Graph traversal (BFS or DFs?) Breadth First Search Implemented with QUEUE(FIFO) Finds pages along shortest paths If we start with "good"pages, this keeps us close; maybe other good Depth first search stu斤f Depth first search Implemented with STACK (LIFO) Wander away (lost in cyberspace") Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics

Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer Graph traversal (BFS or DFS?) • Breadth First Search – Implemented with QUEUE (FIFO) – Finds pages along shortest paths – If we start with “good” pages, this keeps us close; maybe other good stuff… • Depth First Search – Implemented with STACK (LIFO) – Wander away (“lost in cyberspace”)

a basic crawler in perl Queue: a FIFO list(shift and push) my @frontier read_seeds( file while(@frontier&&$tot $max)[ my next_link shift @frontier: my $page fetch( next_link) add_to_index($page): my @links extract_links(page, $next_link) push @frontier, process(@links) Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics

Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer A basic crawler in Perl • Queue: a FIFO list (shift and push) my @frontier = read_seeds($file); while (@frontier && $tot < $max) { my $next_link = shift @frontier; my $page = fetch($next_link); add_to_index($page); my @links = extract_links($page, $next_link); push @frontier, process(@links); }

Implementation issues Don 't want to fetch same page twice Keep lookup table(hash)of visited pages What if not visited but in frontier already? The frontier grows very fast May need to prioritize for large crawls Fetcher must be robust! Don't crash if download fails Timeout mechanism Determine file type to skip unwanted files Can try using extensions but not reliable Can issue Head' httP commands to get Content-type (MIME) headers, but overhead of extra Internet requests Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics

Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer Implementation issues • Don’t want to fetch same page twice! – Keep lookup table (hash) of visited pages – What if not visited but in frontier already? • The frontier grows very fast! – May need to prioritize for large crawls • Fetcher must be robust! – Don’t crash if download fails – Timeout mechanism • Determine file type to skip unwanted files – Can try using extensions, but not reliable – Can issue ‘HEAD’ HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests

More implementation issues Fetching Get only the first 10-100 KB per page Take care to detect and break redirection loops Soft fail for timeout server not responding file not found, and other errors Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics

Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer More implementation issues • Fetching – Get only the first 10-100 KB per page – Take care to detect and break redirection loops – Soft fail for timeout, server not responding, file not found, and other errors

点击进入文档下载页（PPT格式）

共86页，可试读20页，点击继续阅读 ↓↓

您可能感兴趣的文档

《Java编程导论》课程教学资源（PPT课件讲稿）Chapter 8 Strings and Text I/O
《计算机网络与通讯》课程教学资源（PPT课件讲稿，英文版）Chapter 3 Transport Layer
C++ Review
《计算机网络与通讯》课程教学资源（PPT课件讲稿，英文版）Chapter 07 Network Security
Incorporating Structured World Knowledge into Unstructured Documents via——Heterogeneous Information Networks
FairCloud：Sharing the Network in Cloud Computing
香港科技大学：《计算机网络 Computer Networks》课程教学资源（PPT课件）Chapter 1 Introduction of computer networking
Fluent：《GAMBIT建模教程》教学资源（PPT讲稿）Geometry Operations in GAMBIT
有限元分析 ANSYS：Modeling Turbulent Flows（PPT讲稿）Introductory FLUENT Training
隐马尔科夫模型和词性标注（PPT课件讲稿）
哈尔滨工业大学：《中文信息处理》课程教学资源（PPT课件讲稿）句法分析（张宇）
新乡学院：《计算机网络》课程教学大纲（适用专业：信息与计算科学）
《操作系统》课程教学资源（PPT课件讲稿）Chapter 1 and 2 Computer System and Operating System Overview
《操作系统》课程教学资源（PPT课件讲稿）Chapter 6 Concurrency Deadlock and Starvation
《操作系统》课程教学资源（PPT课件讲稿）Chapter 8 Virtual Memory
《图像处理与计算机视觉 Image Processing and Computer Vision》课程教学资源（PPT课件讲稿）Chapter 10 Pose estimation by the iterative method
Introduction to Internet and TCPIP（PPT讲稿）IP转发 IP FORWARDING
GD-Aggregate：A WAN Virtual Topology Building Tool for Hard Real-Time and Embedded Applications
《图像处理与计算机视觉 Image Processing and Computer Vision》课程教学资源（PPT课件讲稿）Chapter 05 Hough transform
香港中文大学：Image processing and computer vision（PPT课件讲稿）Edge detection and image filtering
《图像处理与计算机视觉 Image Processing and Computer Vision》课程教学资源（PPT课件讲稿）Chapter 07 Mean-shift and Cam-shift
Essential Cluster OS Commands
香港浸会大学：Kickstart Tutorial/Seminar on using the 64-nodes P4-Xeon Cluster in Science Faculty
香港浸会大学：并行输入输出（PPT讲稿）Parallel I/O

点击购买下载（PPT）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录