Edit distance ED(r s): The minimum number of single-character edit operations(insertion/deletion/substitution to transform r to s Forexample: ED(hilton, huston)=2 hilton substitute i with u hulton substitute l with u huston Property:ED(r, s)2IIrI-Isl 20/2021 PassJoin a VLDB2012
ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(hilton, huston) = 2 Property: ED(r, s) ≥ ||r|-|s|| Edit Distance 1/29/2021 PassJoin @ VLDB2012 6 hilton hulton huston substitute i with u substitute l with u
Problem formulation Give threshold t=3 [ID Strings $1 vankatesh S2 avataresha s3 kaushic chaduri S4 kaushik chakra S5 kaushuk chadhui s6<caushik chakrabar ED<s1,2>=5ED<S1,S3>=13ED<S1,S4=12ED<S1,S5>=12 ED<S1,S6>=14ED<S2,S3>=12ED<2,4>=12ED<S2,>=12 ED<s2,S6>=14ED<s3,S4=5ED<S3,5>=4ED<s3,S6>=8 ED<S4,S5>=4ED<4,S6=3ED<S,S6>=8 1/29/2021 PassJoin a VLDB2012
Problem Formulation 1/29/2021 PassJoin @ VLDB2012 7 Give threshold τ=3 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8
Filter-and-refine methods Basic idea e Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs 20/2021 PassJoin a VLDB2012
Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs 1/29/2021 PassJoin @ VLDB2012 8
Filter-and-refine methods Give threshold t=3 DI Strings Length s1 vankatesh 9 Pruning Condition. s2/ avataresha s3 kaushic chaduri 15 S 3 s4 kaushik chakrab15 ss kaushuk chadhui|15 s6 caushik chakrabar|17 ED≤s1,2>=5113B1412E“112 E1当-44E23212E<24-12ED=212 E当23214ED<S3,S4>=5ED<S3,S5>=4ED<S3,S6>=8 ED<S4,S5>=4ED<4,S6>=3ED<S5,S6>=8 1/29/2021 PassJoin a VLDB2012
Filter-and-refine Methods 1/29/2021 PassJoin @ VLDB2012 9 Give threshold τ=3 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8 Pruning Condition: ||si | - |sj || > 3
Filter-and-refine methods ● Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks o Need to tune parameters Bad for short strings 20/2021 PassJoin a VLDB2012
Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks Need to tune parameters Bad for short strings 1/29/2021 PassJoin @ VLDB2012 10