Edit distance ED( s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s For example: ED(sigcom, sigmod )=2 sitcom substitute c with m sIemon substitute m with d sigmod
• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. • For example: ED(sigcom, sigmod) = 2 Edit Distance sigcom sigmom sigmod substitute c with m substitute m with d
Problem definition DEFINITION 1 (STRING SIMILARITY SEARCH). Given a string dataset R, a query string s, and an edit-distance thresh old T, the string similarity search with edit-distance con- straints finds all strings r ER such that ed(r,S)<T id string Query string lmyouteca s= yotubecom"andτ=2 r2 ubuntucom r3 utubbecou r4 youtbecom r5 yoytubeca ed(s, r4<=2 output ra as a result string dataset R
Problem Definition Query string s = “yotubecom” and τ = 2 string dataset R ed(s, r4 ) <= 2 output r4 as a result
Application Spell checking Copy detection Entity Linking ·Bⅰ informatic
Application • Spell Checking • Copy Detection • Entity Linking • Bioinformatic …
Challenge Naive method Time complexity: for each query O(RIs t
Challenge Naïve Method Time complexity: for each query 𝑂 |𝑅| |𝑠| τ
Filter-and-Verification framework Query string s Filter: Index No es Dataset r Signature()∩ Verify: Results Signature (r)=o? ED(rs)≤τ? Threshold T Pruning power Filtering Cost depends on depends on Sig(r)Ix(sig(s)) and Index: (sig(r)I(Probe set size) min((sig(r)(Sig(s) No Index: (sig(r)+(sig(s)I
No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ? Verify: ED(r,s) ≤ τ? Yes Index