Server traps ·防止系统异常 -病态HTML文件 ·例如,有的网页含有68kBu字符 -误导Crawler的网站 ·用CG程序产生无限个网页 ·用软目录创建的很深的路径 -www.troutbums.com/Flyfactory/hatchline/hatchline/ hatchline/flyfactory/flyfactory/flyfactory/flyfactory/fly factory/flyfactory/flyfactory/flyfactory/hatchline ·HTTP服务器中的路径重映射特征
Server traps • 防止系统异常 – 病态HTML文件 • 例如,有的网页含有68 kB null字符 – 误导Crawler的网站 • 用CGI程序产生无限个网页 • 用软目录创建的很深的路径 – www.troutbums.com/Flyfactory/hatchline/hatchline/ hatchline/flyfactory/flyfactory/flyfactory/flyfactory/fly factory/flyfactory/flyfactory/flyfactory/hatchline • HTTP服务器中的路径重映射特征
Web Crawler need... ·快Fast -Bottleneck?Network utilization ·可扩展性Scalable Parallel,distributed ·友好性Polite DoS (Deny of Service Attack),robot.txt ·健壮Robust Traps,errors,crash recovery ·持续搜集Continuous -Batch or incremental ·时新性Freshness
Web Crawler need… • 快 Fast – Bottleneck? Network utilization • 可扩展性 Scalable – Parallel , distributed • 友好性 Polite – DoS(Deny of Service Attack), robot.txt • 健壮 Robust – Traps, errors, crash recovery • 持续搜集 Continuous – Batch or incremental • 时新性Freshness
High Performance Web Crawler
High Performance Web Crawler
系统框图 We A high level view of a web crawler Fetcher Extractor Writer Add new Download Find URL's URL's Document ◇ In Document To Database △ △ PreProcessor PostProcessor Frontier Request URL Read/Write URL URL's Database
系统框图 Frontier Fetcher Extractor Writer PreProcessor PostProcessor
Caching DKS Async UDP (slack a不out DNS prefetch 大规模Web Crawler的 expiry dates)】 client DNS Text indexing 种结构图 Text resolver and other Writer pository DNS client(UDP) analyses index sache Fetcher Extractor Wait perlink Wait HTTP extractor until send for HTTP normalizer DNS Socket and receive available Per-server queues Page fetching context/thread isPageKnown? 兰 Crawl meta-data Frontier 立 Load monitor Persistent URL work-thread global work isUrlVisited? approval manager pool of URLs guard
大规模Web Crawler的 一种结构图 Frontier Fetcher Writer Extractor