Stemming a The next task is stemming transforming words to root form Computing, Computer, Computation >comput Suffix based methods Remove ability"from"computability L+ness, +ive, >remove Suffix list context rules
6 Stemming ◼ The next task is stemming: transforming words to root form ◼ Computing, Computer, Computation →comput ◼ Suffix based methods ◼ Remove “ability” from “computability” ◼ “…”+ness, “…”+ive, → remove ◼ Suffix list + context rules
Thesaurus rules a thesaurus aims at classification of words in a language for a word it gives related terms which are broader than narrower than same as (synonyms)and opposed to(antonyms)of the given word (other kinds of relationships may exist, e.g, composed of) Static Thesaurus tables anneal, strain], [antenna, receiver] Roget's thesaurus WordNet at princeton 7
7 Thesaurus Rules ◼ A thesaurus aims at ◼ classification of words in a language ◼ for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) ◼ Static Thesaurus Tables ◼ [anneal, strain], [antenna, receiver], … ◼ Roget’s thesaurus ◼ WordNet at Preinceton
Thesaurus rules can also be learned From a search engine query log After typing queries browse If query 1 and query2 leads to the same document Then, Similar(query l, query2) If queryl leads to Document with title keyword K Then, Similar(query1, k) Then transitivity Microsoft research china s work in WWW10 (Wen et al )on Encarta online
8 Thesaurus Rules can also be Learned ◼ From a search engine query log ◼ After typing queries, browse… ◼ If query1 and query2 leads to the same document ◼ Then, Similar(query1, query2) ◼ If query1 leads to Document with title keyword K, ◼ Then, Similar(query1, K) ◼ Then, transitivity… ◼ Microsoft Research China’s work in WWW10 (Wen, et al.) on Encarta online
The vector-Space Model Distinct terms are available call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document n<T1,T2,T3,T4T5>or<W(T1),W(T2),W(T3)W(T4),W(T5)> T1=architecture 2=bu T3==computer T4=database T5=xm computer science collection index terms or vocabulary of the colelction
9 The Vector-Space Model ◼ T distinct terms are available; call them index terms or the vocabulary ◼ The index terms represent important terms for an application → a vector to represent the document ◼ <T1,T2,T3,T4,T5> or <W(T1),W(T2),W(T3),W(T4),W(T5)> T1=architecture T2=bus T3=computer T4=database T5=xml computer science collection index terms or vocabulary of the colelction
The vector-Space Model Assumptions: words are uncorrelated Given 1.n documents and a query TT 2. Query considered a document 1 012 too DD 2. Each represented by t terms 202122 3. Each term in document i has weight 4. We will deal with how to compute the weights later
10 The Vector-Space Model ◼ Assumptions: words are uncorrelated T1 T2 …. Tt D1 d11 d12 … d1t D2 d21 d22 … d2t : : : : : : : : Dn dn1 dn2 … dnt Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight 4. We will deal with how to compute the weights later ij d Q q q qt ... 1 2