Search Engines Architecture Web crawler Starts with a set of seeds, which are a set of urls given to it as parameters Seeds are added to a url request queue Crawler starts fetching pages from the request queue Downloaded pages are parsed to find link tags that might contain other useful Urls to fetch New URLs added to the crawler's request queue, or frontier Continue until no more new urls or disk full
Search Engines 11 Web Crawler ▪ Starts with a set of seeds, which are a set of URLs given to it as parameters ▪ Seeds are added to a URL request queue ▪ Crawler starts fetching pages from the request queue ▪ Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch ▪ New URLs added to the crawler’s request queue, or frontier ▪ Continue until no more new URLs or disk full Architecture
Search Engines Architecture Crawling picture URLs crawled and parsed Unseen Web Seed URLS frontier pages Web
Search Engines 12 Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages Architecture
Search Engines Architecture Crawling the Web /index. html /couses /index html /index html www.cs.umassedh /2006/09/story. html /index html /2003/04/sto ry html www.bbc.couk /news html www.crnscom /abouthtml www.whitehouse.gov crawlersearhengine com
Search Engines 13 Crawling the Web Architecture
Search Engines Architecture Text transformation Parser Processing the sequence of text tokens in the document to recognize structural elements e.g., titles, links, headings, etc Tokenizer recognizes words" in the text must consider issues like capitalization, hyphens, apostrophes non-alpha characters, separators Markup languages such as HTML, XML often used to specify structure Tags used to specify document elements E.g., <h2> Overview </h2> Document parser uses syntax of markup language(or other formatting to identify structure
Search Engines 14 Text Transformation ▪ Parser ▪ Processing the sequence of text tokens in the document to recognize structural elements ▪ e.g., titles, links, headings, etc. ▪ Tokenizer recognizes “words” in the text ▪ must consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators ▪ Markup languages such as HTML, XML often used to specify structure ▪ Tags used to specify document elements ▪ E.g., <h2> Overview </h2> ▪ Document parser uses syntax of markup language (or other formatting) to identify structure Architecture
Search Engines Architecture Text transformation Stopping Remove common words(stop words) e.g,and,"or","the","int a Some impact on efficiency and effectiveness Can be a problem for some queries Stemming Group words derived from a common stem e.g,computer ,computers", computing ,"compute Usually effective but not for all queries Benefits vary for different languages
Search Engines 15 Text Transformation ▪ Stopping ▪ Remove common words (stop words) ▪ e.g., “and”, “or”, “the”, “in” ▪ Some impact on efficiency and effectiveness ▪ Can be a problem for some queries ▪ Stemming ▪ Group words derived from a common stem ▪ e.g., “computer”, “computers”, “computing”, “compute” ▪ Usually effective, but not for all queries ▪ Benefits vary for different languages Architecture