Web Search Overview Crawling IR Documents vs. Database Records Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes) e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth,etc. Easy to compare fields with well-defined semantics to queries in order to find matches Text is more difficult 6
Web Search Overview & Crawling 6 Documents vs. Database Records ▪ Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes) ▪ e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc. ▪ Easy to compare fields with well-defined semantics to queries in order to find matches ▪ Text is more difficult IR
Web Search Overview Crawling IR Documents vs. Records Example bank database query Find records with balance >$50,000 in branches located in Amherst, MA. Matches easily found by comparison with field values of records Example search engine query bank scandals in western mass This text must be compared to the text of entire news stories 7
Web Search Overview & Crawling 7 Documents vs. Records ▪ Example bank database query ▪ Find records with balance > $50,000 in branches located in Amherst, MA. ▪ Matches easily found by comparison with field values of records ▪ Example search engine query ▪ bank scandals in western mass ▪ This text must be compared to the text of entire news stories IR
Web Search Overview Crawling IR Comparing Text Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval Exact matching of words is not enough Many different ways to write the same thing in a "natural language" like English e.g., does a news story containing the text "bank director in Amherst steals funds"match the query? Some stories will be better matches than others 8
Web Search Overview & Crawling 8 Comparing Text ▪ Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval ▪ Exact matching of words is not enough ▪ Many different ways to write the same thing in a “natural language” like English ▪ e.g., does a news story containing the text “bank director in Amherst steals funds” match the query? ▪ Some stories will be better matches than others IR
Web Search Overview Crawling IR Dimensions of IR IR is more than just text, and more than just web search although these are central People doing IR work with different media, different types of search applications, and different tasks 9
Web Search Overview & Crawling 9 Dimensions of IR ▪ IR is more than just text, and more than just web search ▪ although these are central ▪ People doing IR work with different media, different types of search applications, and different tasks IR
Web Search Overview Crawling IR Other Media New applications increasingly involve new media e.g., video, photos, music, speech Like text, content is difficult to describe and compare text may be used to represent them (e.g. tags) IR approaches to search and evaluation are appropriate 10
Web Search Overview & Crawling 10 Other Media ▪ New applications increasingly involve new media ▪ e.g., video, photos, music, speech ▪ Like text, content is difficult to describe and compare ▪ text may be used to represent them (e.g. tags) ▪ IR approaches to search and evaluation are appropriate IR