Language Models Using Language Models in IR Treat each document as the basis for a model ( e. g unigram sufficient statistics) Rank document d based on P(d g) Pld a= plq l d)x pld)/ plg P(g) is the same for all documents so ignore P(d)ithe prior] is often treated as the same for all d But we could use criteria like authority length, genre Plq d) is the probability of g given d' s model Very general formal approach
Language Models 11 Using Language Models in IR ▪ Treat each document as the basis for a model (e.g., unigram sufficient statistics) ▪ Rank document d based on P(d | q) ▪ P(d | q) = P(q | d) x P(d) / P(q) ▪ P(q) is the same for all documents, so ignore ▪ P(d) [the prior] is often treated as the same for all d ▪ But we could use criteria like authority, length, genre ▪ P(q | d) is the probability of q given d’s model ▪ Very general formal approach
Language Models The fundamental problem of LMs Usually we don 't know the model m But have a sample of text representative of that model P(●o●。lM( Estimate a language model from a sample Then compute the observation probability
Language Models 12 The fundamental problem of LMs ▪ Usually we don’t know the model M ▪ But have a sample of text representative of that model ▪ ▪ Estimate a language model from a sample ▪ Then compute the observation probability P ( | M ( ) ) M
Language Models Query Likelihood Model Language Models for iR Language Modeling approaches Attempt to model query generation process Documents are ranked by the probability that a query would be observed as a random sample from the respective document model Multinomial approach P(QIMD)=P(wlMD)fu
Language Models 13 Language Models for IR ▪ Language Modeling Approaches ▪ Attempt to model query generation process ▪ Documents are ranked by the probability that a query would be observed as a random sample from the respective document model ▪ Multinomial approach Query Likelihood Model
Language Models Query Likelihood Model Retrieval based on probabilistic LM Treat the generation of queries as a random process Approach Infer a language model for each document Estimate the probability of generating the query according to each of these models Rank the documents according to these probabilities Usually a unigram estimate of words is used Some work on bigrams, paralleling van rijsbergen
Language Models 14 Retrieval based on probabilistic LM ▪ Treat the generation of queries as a random process. ▪ Approach ▪ Infer a language model for each document. ▪ Estimate the probability of generating the query according to each of these models. ▪ Rank the documents according to these probabilities. ▪ Usually a unigram estimate of words is used ▪ Some work on bigrams, paralleling van Rijsbergen Query Likelihood Model
Language Models Query Likelihood Model Retrieval based on probabilistic LM Intuition Users Have a reasonable idea of terms that are likely to occur in documents of interest They will choose query terms that distinguish these documents from others in the collection Collection statistics Are integral parts of the language model are not used heuristically as in many other approaches In theory In practice there s usually some wiggle room for empirically set parameters
Language Models 15 Retrieval based on probabilistic LM ▪ Intuition ▪ Users … ▪ Have a reasonable idea of terms that are likely to occur in documents of interest. ▪ They will choose query terms that distinguish these documents from others in the collection. ▪ Collection statistics … ▪ Are integral parts of the language model. ▪ Are not used heuristically as in many other approaches. ▪ In theory. In practice, there’s usually some wiggle room for empirically set parameters Query Likelihood Model