Outline Motivation Brief review on Existing Similarity Measures Challenges and Our Solutions Conclusion
Outline ◼ Motivation ◼ Brief Review on Existing Similarity Measures ◼ Challenges and Our Solutions ◼ Conclusion
Similarity methods Similarity methods String Similarity Numeric Data Similarity Character-based Token-based Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith -Waterman distance Q-grams with tf idf Jaro distance metric Q-gram distance
Similarity methods Similarity methods String Similarity Numeric Data Similarity Character-based Token-based Edit distance Affine gap distance Smith-Waterman distance Jaro distance metric Q-gram distance Atomic strings WHIRL Q-grams with tf.idf Treated as strings
edit distance edit distance, a.k.a. Levenshtein distance The minimum number of edit operations(insertions, deletions and substitutions) of single characters needed to transform the string SI into $2 E xample SI: unne cessarily Edit distance(s1, s2)=3 S2: un escessaraly O(SI, S2D Problem last names first names and street names that did not agree on a character-by- character basis For example: Similarity (John R Smith, Johathan Richard Smith)=11
Edit distance ◼ Edit distance, a.k.a. Levenshtein distance The minimum number of edit operations (insertions, deletions, and substitutions) of single characters needed to transform the string S1 into S2. ◼ Problem: last names, first names, and street names that did not agree on a character-bycharacter basis For example: Similarity(John R.Smith,Johathan Richard Smith)=11 ◼ Example1: S1:unne cessarily Edit distance(S1,S2)=3 S2:un escessaraly O(|S1| ,|S2|)
Affine gap distance Two extra edit operations: open gap and extend gap cost(g)=S+e×l(e<s), where s is the cost of opening a gap e is the cost of extending a gap, and l is the length of a gap in the alignment of two strings Example2(Affine gap distance) F.R.Smith o hn richard smi ith This method is better when matching strings have been truncated or shortened
Affine gap distance ◼ Two extra edit operations: open gap and extend gap cost(g) =s + e × l ( e<s ), where s is the cost of opening a gap, e is the cost of extending a gap, and l is the length of a gap in the alignment of two strings ◼ Example2 (Affine gap distance): ◼ This method is better when matching strings have been truncated or shortened "J. R. S m i t h“ " J o h n R i c h a r d S m i t h