Evaluation Unranked Retrieval Evaluation Unranked retrieval evaluation Precision and recall Precision fraction of retrieved docs that are relevant P(relevant retrieved Recall fraction of relevant docs that are retrieved P(retrieved relevant Relevant Nonrelevant Retrieved Not Retrieved fn Precision P= tp/tp fp) Recall r=tp/tp+ fn)
Evaluation 11 Unranked retrieval evaluation: Precision and Recall ▪ Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) ▪ Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) ▪ Precision P = tp/(tp + fp) ▪ Recall R = tp/(tp + fn) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn Unranked Retrieval Evaluation
Evaluation Unranked Retrieval Evaluation Should we instead use the accuracy measure for evaluation Given a query, an engine classifies each doc as Relevant"or nonrelevant The accuracy of an engine: the fraction of these classifications that are correct (tp+tn)/(tp+ fp+ fn +tn) a Accuracy is a commonly used evaluation measure in machine learning classification work Why is this not a very useful evaluation measure in IR?
Evaluation 12 Should we instead use the accuracy measure for evaluation? ▪ Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant” ▪ The accuracy of an engine: the fraction of these classifications that are correct ▪ (tp + tn) / ( tp + fp + fn + tn) ▪ Accuracy is a commonly used evaluation measure in machine learning classification work ▪ Why is this not a very useful evaluation measure in IR? Unranked Retrieval Evaluation
Evaluation Unranked Retrieval Evaluation Why not just use accuracy? How to build a 99. 9999% accurate search engine on a low budget Noodle com Search for 0 matching results found People doing information retrieval want to find something and have a certain tolerance for junk
Evaluation 13 Why not just use accuracy? ▪ How to build a 99.9999% accurate search engine on a low budget…. ▪ People doing information retrieval want to find something and have a certain tolerance for junk. Search for: 0 matching results found. Unranked Retrieval Evaluation
Evaluation Unranked Retrieval Evaluation Precision/Recal You can get high recall (but low precision) by retrieving all docs for all queries Recall is a non-decreasing function of the number of docs retrieved In a good system precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong empirical confirmatⅰon
Evaluation 14 Precision/Recall ▪ You can get high recall (but low precision) by retrieving all docs for all queries! ▪ Recall is a non-decreasing function of the number of docs retrieved ▪ In a good system, precision decreases as either the number of docs retrieved or recall increases ▪ This is not a theorem, but a result with strong empirical confirmation Unranked Retrieval Evaluation
Evaluation Unranked Retrieval Evaluation Difficulties in using precision /recall Should average over large document collection/query ensembles Need human relevance assessments People aren ' t reliable assessors Assessments have to be binary nuanced assessments? Heavily skewed by collection /authorship Results may not translate from one domain to another
Evaluation 15 Difficulties in using precision/recall ▪ Should average over large document collection/query ensembles ▪ Need human relevance assessments ▪ People aren’t reliable assessors ▪ Assessments have to be binary ▪ Nuanced assessments? ▪ Heavily skewed by collection/authorship ▪ Results may not translate from one domain to another Unranked Retrieval Evaluation