Search Engines Architecture Text transformation Link analysis Makes use of links and anchor text in web pages Link analysis identifies popularity and community information e.g., PageRank Anchor text can significantly enhance the representation of pages pointed to by links Significant impact on web search Less importance in other applications
Search Engines 16 Text Transformation ▪ Link Analysis ▪ Makes use of links and anchor text in web pages ▪ Link analysis identifies popularity and community information ▪ e.g., PageRank ▪ Anchor text can significantly enhance the representation of pages pointed to by links ▪ Significant impact on web search ▪ Less importance in other applications Architecture
Search Engines Architecture Text transformation Information extraction Identify classes of index terms that are important for some applications e.g., named entity recognizers identify classes such as people, locations, companies, dates, etc Classifier Identifies class-related metadata for documents a i.e. assigns labels to documents e.g., topics, reading levels, sentiment, genre Use depends on application 17
Search Engines 17 Text Transformation ▪ Information Extraction ▪ Identify classes of index terms that are important for some applications ▪ e.g., named entity recognizers identify classes such as people, locations, companies, dates, etc. ▪ Classifier ▪ Identifies class-related metadata for documents ▪ i.e., assigns labels to documents ▪ e.g., topics, reading levels, sentiment, genre ▪ Use depends on application Architecture
Search Engines Architecture Index creation Document statistⅰcs Gathers counts and positions of words and other features Used in ranking algorithm Weighting Computes weights for index terms Used in ranking agorithm e.g. tf idf weight Combination of term frequency in document and inverse document frequency in the collection
Search Engines 18 Index Creation ▪ Document Statistics ▪ Gathers counts and positions of words and other features ▪ Used in ranking algorithm ▪ Weighting ▪ Computes weights for index terms ▪ Used in ranking algorithm ▪ e.g., tf.idf weight ▪ Combination of term frequency in document and inverse document frequency in the collection Architecture
Search Engines Architecture Index creatⅰon Inversion Core of indexing process Converts document-term information to term-document for indexing Difficult for very large numbers of documents Format of inverted file is designed for fast query processing Must also handle updates Compression used for efficiency
Search Engines 19 Index Creation ▪ Inversion ▪ Core of indexing process ▪ Converts document-term information to term-document for indexing ▪ Difficult for very large numbers of documents ▪ Format of inverted file is designed for fast query processing ▪ Must also handle updates ▪ Compression used for efficiency Architecture
Search Engines Architecture Index creatⅰon ndex distributⅰon a Distributes indexes across multiple computers and /or multiple sites Essential for fast query processing with large numbers of documents Many variations Document distribution, term distribution, replication P2P and distributed iR involve search across multiple sites
Search Engines 20 Index Creation ▪ Index Distribution ▪ Distributes indexes across multiple computers and/or multiple sites ▪ Essential for fast query processing with large numbers of documents ▪ Many variations ▪ Document distribution, term distribution, replication ▪ P2P and distributed IR involve search across multiple sites Architecture