Term Vocabulary and Postings Lists Document Delineation Complications: Format/language Documents being indexed can include docs from many different languages a single index may have to contain terms of several languages a Sometimes a document or its components can contain multiple languages/formats French email with a German pdf attachment What is a unit document? A file? An email?( Perhaps one of many in an mbox An email with 5 attachments? A group of files(ppt or la teX as html pages)
Term Vocabulary and Postings Lists 6 Complications: Format/language ▪ Documents being indexed can include docs from many different languages ▪ A single index may have to contain terms of several languages. ▪ Sometimes a document or its components can contain multiple languages/formats ▪ French email with a German pdf attachment. ▪ What is a unit document? ▪ A file? ▪ An email? (Perhaps one of many in an mbox.) ▪ An email with 5 attachments? ▪ A group of files (PPT or LaTeX as HTML pages) Document Delineation
Term Vocabulary and Postings Lists Vocabulary of Terms TOKENS AND TERMS
Term Vocabulary and Postings Lists 7 TOKENS AND TERMS Vocabulary of Terms
Term Vocabulary and Postings Lists Vocabulary of Terms Tokenization Input: "Friends, Romans and Countrymen Output: Tokens friends Romans Countrymen a token is an instance of a sequence of characters Each such token is now a candidate for an index entry after further processing Described below But what are valid tokens to emit?
Term Vocabulary and Postings Lists 8 Tokenization ▪ Input: “Friends, Romans and Countrymen” ▪ Output: Tokens ▪ Friends ▪ Romans ▪ Countrymen ▪ A token is an instance of a sequence of characters ▪ Each such token is now a candidate for an index entry, after further processing ▪ Described below ▪ But what are valid tokens to emit? Vocabulary of Terms
Term Vocabulary and Postings Lists Vocabulary of Terms Tokenization Issues in tokenization Finland's capita/→ Finland? Fin/ands? Finland's? Hewlett-Packard -> Hewlett and packard as two tokens? state-of-the-art: break up hyphenated sequence CO-education lowercase, ower-case, lower case It can be effective to get the user to put in possible hyphens San francisco one token or two? How do you decide it is one token?
Term Vocabulary and Postings Lists 9 Tokenization ▪ Issues in tokenization: ▪ Finland’s capital → Finland? Finlands? Finland’s? ▪ Hewlett-Packard → Hewlett and Packard as two tokens? ▪ state-of-the-art: break up hyphenated sequence. ▪ co-education ▪ lowercase, lower-case, lower case ? ▪ It can be effective to get the user to put in possible hyphens ▪ San Francisco: one token or two? ▪ How do you decide it is one token? Vocabulary of Terms
Term Vocabulary and Postings Lists Vocabulary of Terms Numbers 3/20/91 Mar12,1991 20/391 55B.C B-52 My PG key is 324a3df234cb23e 800)234-2333 Often have embedded spaces Older iR systems may not index numbers But often very useful: think about things like looking up error codes/stacktraces on the web (One answer is using n-grams Lecture 2.2) Will often index"meta-data" separately Creation date, format etc
Term Vocabulary and Postings Lists 10 Numbers ▪ 3/20/91 Mar. 12, 1991 20/3/91 ▪ 55 B.C. ▪ B-52 ▪ My PGP key is 324a3df234cb23e ▪ (800) 234-2333 ▪ Often have embedded spaces ▪ Older IR systems may not index numbers ▪ But often very useful: think about things like looking up error codes/stacktraces on the web ▪ (One answer is using n-grams: Lecture 2.2) ▪ Will often index “meta-data” separately ▪ Creation date, format, etc. Vocabulary of Terms