Graphic Representation Example D1=27+ 372+573 D2=37+D,=27+3,+57 2+73 Q=071+072+273 Q=07+072 +273 D2=371+772+73 Is D, or D, more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?
11 Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 2 3 5 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection?
Similarity measure -Inner Product Similarity between documents D, and query Q can be computed as the inner vector product sim( Di, Q)= (D;. Q k=1 =∑ln*q Binary: weight= 1 if word present, 0 o/w Non-binary: weight represents degree of similary Example: TF/IDF We explain later 12
12 Similarity Measure - Inner Product ◼ Similarity between documents Di and query Q can be computed as the inner vector product: sim ( Di , Q ) = (Di • Q) ◼ Binary: weight = 1 if word present, 0 o/w ◼ Non-binary: weight represents degree of similary ◼ Example: TF/IDF we explain later k t = 1 = = t j dij qj 1 *
Inner Product--EXamples size of vector size Binary of vocabulary =7 Q 0 sim(D,Q)=3 Weighted D1=Z7+372+573 Q=01+072+273 sim(D1,Q=2*0+3*0+5*2=10
13 Inner Product -- Examples Binary: ◼ D = 1, 1, 1, 0, 1, 1, 0 ◼ Q = 1, 0 , 1, 0, 0, 1, 1 → sim(D, Q) = 3 ◼ Size of vector = size of vocabulary = 7 Weighted D1 = 2T1 + 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
Properties of Inner Product The inner product similarity is unbounded Favors long documents long document a large number of unique terms, each of which may occur many times measures how many terms matched but not how many terms not matched 14
14 Properties of Inner Product ◼ The inner product similarity is unbounded ◼ Favors long documents ◼ long document a large number of unique terms, each of which may occur many times ◼ measures how many terms matched but not how many terms not matched
Cosine similarity Measures a Cosine similarity measures the 0, cosine of the angle between D o two vectors 已2 Inner product normalized by the vector lengths ∑(dlk°q) CasSim( Di, Q 2 i k k 15
15 Cosine Similarity Measures ◼ Cosine similarity measures the cosine of the angle between two vectors ◼ Inner product normalized by the vector lengths 2 t3 t1 t2 D1 D2 Q 1 = = = • • t k t k t k k i k d q d q ik k 1 1 2 2 1 ( ) CosSim(Di , Q) =