seed URLs Basic crawlers itialize frontier This is a sequential frontier dequeue URL from frontier crawler fetch page Seeds can be any list of starting URls J extract URLs and add to frontier Order of page visits is determined by frontier epository store data structure done? Stop criterion can be anything Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer Basic crawlers • This is a sequential crawler • Seeds can be any list of starting URLs • Order of page visits is determined by frontier data structure • Stop criterion can be anything
Breadth first search Graph traversal (BFS or DFs?) Breadth First Search Implemented with QUEUE(FIFO) Finds pages along shortest paths If we start with "good"pages, this keeps us close; maybe other good Depth first search stu斤f Depth first search Implemented with STACK (LIFO) Wander away (lost in cyberspace") Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer Graph traversal (BFS or DFS?) • Breadth First Search – Implemented with QUEUE (FIFO) – Finds pages along shortest paths – If we start with “good” pages, this keeps us close; maybe other good stuff… • Depth First Search – Implemented with STACK (LIFO) – Wander away (“lost in cyberspace”)
a basic crawler in perl Queue: a FIFO list(shift and push) my @frontier read_seeds( file while(@frontier&&$tot $max)[ my next_link shift @frontier: my $page fetch( next_link) add_to_index($page): my @links extract_links(page, $next_link) push @frontier, process(@links) Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer A basic crawler in Perl • Queue: a FIFO list (shift and push) my @frontier = read_seeds($file); while (@frontier && $tot < $max) { my $next_link = shift @frontier; my $page = fetch($next_link); add_to_index($page); my @links = extract_links($page, $next_link); push @frontier, process(@links); }
Implementation issues Don 't want to fetch same page twice Keep lookup table(hash)of visited pages What if not visited but in frontier already? The frontier grows very fast May need to prioritize for large crawls Fetcher must be robust! Don't crash if download fails Timeout mechanism Determine file type to skip unwanted files Can try using extensions but not reliable Can issue Head' httP commands to get Content-type (MIME) headers, but overhead of extra Internet requests Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer Implementation issues • Don’t want to fetch same page twice! – Keep lookup table (hash) of visited pages – What if not visited but in frontier already? • The frontier grows very fast! – May need to prioritize for large crawls • Fetcher must be robust! – Don’t crash if download fails – Timeout mechanism • Determine file type to skip unwanted files – Can try using extensions, but not reliable – Can issue ‘HEAD’ HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests
More implementation issues Fetching Get only the first 10-100 KB per page Take care to detect and break redirection loops Soft fail for timeout server not responding file not found, and other errors Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer More implementation issues • Fetching – Get only the first 10-100 KB per page – Take care to detect and break redirection loops – Soft fail for timeout, server not responding, file not found, and other errors