Probabilistic Information Retrieval Probability Ranking Principle Probability Ranking Principle More complex case: retrieval costs Let d be a document c-cost of retrieval of relevant document C-cost of retrieval of non-relevant document Probability ranking principle: if C·p(R|d)+C·(1-p(R|d)≤C.p(Rd)+C.(1-p(R|d) for all d not yet retrieved, then d is the next document to be retrieved We won't further consider loss/utility from now on
Probabilistic Information Retrieval 11 Probability Ranking Principle ▪ More complex case: retrieval costs. ▪ Let d be a document ▪ C - cost of retrieval of relevant document ▪ C’ - cost of retrieval of non-relevant document ▪ Probability Ranking Principle: if for all d’ not yet retrieved, then d is the next document to be retrieved ▪ We won’t further consider loss/utility from now onC p(R | d) +C (1− p(R | d)) C p(R | d) +C (1− p(R | d)) Probability Ranking Principle
Probabilistic Information Retrieval Probability Ranking Principle Probability ranking principle How do we compute all those probabilities? Do not know exact probabilities have to use estimates Binary Independence retrieval ( bir)-which we discuss later today -is the simplest model Questionable assumptions Relevance of each document is independent of relevance of other documents Really, it's bad to keep on returning duplicates Boolean model of relevance
Probabilistic Information Retrieval 12 Probability Ranking Principle ▪ How do we compute all those probabilities? ▪ Do not know exact probabilities, have to use estimates ▪ Binary Independence Retrieval (BIR) – which we discuss later today – is the simplest model ▪ Questionable assumptions ▪ “Relevance” of each document is independent of relevance of other documents. ▪ Really, it’s bad to keep on returning duplicates ▪ Boolean model of relevance Probability Ranking Principle
Probabilistic Information Retrieval Probabilistic Retrieval Strategy Estimate how terms contribute to relevance How do things like tf, df, and length influence your judgments about document relevance? One answer is the Okapi formulae(S Robertson) Combine to find document relevance probability Order documents by decreasing probability
Probabilistic Information Retrieval 13 Probabilistic Retrieval Strategy ▪ Estimate how terms contribute to relevance ▪ How do things like tf, df, and length influence your judgments about document relevance? ▪ One answer is the Okapi formulae (S. Robertson) ▪ Combine to find document relevance probability ▪ Order documents by decreasing probability
Probabilistic Information Retrieval Probabilistic Ranking Basic concept For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically Van rijsbergen
Probabilistic Information Retrieval 14 Probabilistic Ranking Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen
Probabilistic Information Retrieval Binary Independence Model Binary Independence model Traditionally used in conjunction with PRP Binary"= Boolean: documents are represented as binary incidence vectors of terms cf lecture 1) 1 iff term i is present in document x Independence: terms occur in documents independently Different documents can be modeled as the same vector Bernoulli naive bayes model (cf text categorization!
Probabilistic Information Retrieval 15 Binary Independence Model ▪ Traditionally used in conjunction with PRP ▪ “Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1): ▪ ▪ iff term i is present in document x. ▪ “Independence”: terms occur in documents independently ▪ Different documents can be modeled as the same vector ▪ Bernoulli Naive Bayes model (cf. text categorization!) ( , , ) 1 n x x x = xi =1 Binary Independence Model