Cosine similarity: an Example D1=2T1+372+573 CosIm(D1,Q)=5/38=081 D,=3T1+77,+T; CoSIm(D,O)=1/V59=0.13 Q=071+072+273 D, is 6 times better than D, using cosine similarity but only 5 times better using inner product
16 Cosine Similarity: an Example D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 5 / 38 = 0.81 D2 = 3T1 + 7T2 + T3 CosSim(D2 , Q) = 1 / 59 = 0.13 Q = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product
Document and Term Weights Document term weights are calculated using frequencies in documents(tf) and in collection(idf tf, frequency of term j in document i df document frequency of term j number of documents containing term j id =inverse document frequency of term j log2(N/df)(N: number of documents in collection) a Inverse document frequency --an indication of term values as a document discriminator 17
17 Document and Term Weights ◼ Document term weights are calculated using frequencies in documents (tf) and in collection (idf) tfij = frequency of term j in document i df j = document frequency of term j = number of documents containing term j idfj = inverse document frequency of term j = log2 (N/ df j ) (N: number of documents in collection) ◼ Inverse document frequency -- an indication of term values as a document discriminator
Term Weight Calculations Weight of the jth term in ith document d= ti, id = ti log2(N/df) TF> Term Frequency a term occurs frequently in the document but rarely in the remaining of the collection has a high weight Let max tfi be the term frequency of the most frequent term in documentJ rmalization term freguency=折加maxt
18 Term Weight Calculations ◼ Weight of the jth term in ith document: dij = tfij• idfj = tfij• log2 (N/ df j ) ◼ TF → Term Frequency ◼ A term occurs frequently in the document but rarely in the remaining of the collection has a high weight ◼ Let maxl{tflj} be the term frequency of the most frequent term in document j ◼ Normalization: term frequency = tfij /maxl{tflj}
An example of TF Document=(A Computer Science Student Uses Computers) Vector Model based on keywords( Computer, Engineering Student) Tf(Computer)=2 Tf( Engineering )=0 Tf(Student)=1 Max(Tf)=2 TF weight for Computer= 2/2=1 Engineering =0/2=0 Student=172=0.5 19
19 An example of TF ◼ Document=(A Computer Science Student Uses Computers) ◼ Vector Model based on keywords (Computer, Engineering, Student) Tf(Computer) = 2 Tf(Engineering)=0 Tf(Student) = 1 Max(Tf)=2 TF weight for: Computer = 2/2 = 1 Engineering = 0/2 = 0 Student = ½ = 0.5
Inverse Document Frequency Df gives a the number of times term j appeared among n documents IDF =1/DF Typically use log2(N/df )for IDF Example: given 1000 documents, computer appeared in 200 of them nIDF=log2(1000200=lg4(5)
20 Inverse Document Frequency ◼ Dfj gives a the number of times term j appeared among N documents ◼ IDF = 1/DF ◼ Typically use log2 (N/ df j ) for IDF ◼ Example: given 1000 documents, computer appeared in 200 of them, ◼ IDF= log2 (1000/ 200) =log2 (5)