Term Vocabulary and Postings Lists Vocabulary of Terms Normalization other languages Accents: e.g. French resume vs resume. Umlauts: e.g German Tuebingen vs. Tubingen Should be equivalent Most important criterion How are your users like to write their queries for these words? Even in languages that standardly have accents users often may not type them Often best to normalize to a de-accented term Tuebingen, Tubingen Tubingen( Tubingen
Term Vocabulary and Postings Lists 16 Normalization: other languages ▪ Accents: e.g., French résumé vs. resume. ▪ Umlauts: e.g., German: Tuebingen vs. Tübingen ▪ Should be equivalent ▪ Most important criterion: ▪ How are your users like to write their queries for these words? ▪ Even in languages that standardly have accents, users often may not type them ▪ Often best to normalize to a de-accented term ▪ Tuebingen, Tübingen, Tubingen Tubingen Vocabulary of Terms
Term Vocabulary and Postings Lists Vocabulary of Terms Normalization other languages Normalization of things like date forms 7月30日W.730 Japanese use of kana vS Chinese characters Tokenization and normalization may depend on the language and so is intertwined with language detection Is this Morgen will ich in MIT. German"mit"? Crucial: need to normalize"indexed text as well as query terms into the same form 17
Term Vocabulary and Postings Lists 17 Normalization: other languages ▪ Normalization of things like date forms ▪ 7月30日 vs. 7/30 ▪ Japanese use of kana vs. Chinese characters ▪ Tokenization and normalization may depend on the language and so is intertwined with language detection ▪ Crucial: Need to “normalize” indexed text as well as query terms into the same form Morgen will ich in MIT … Is this German “mit”? Vocabulary of Terms