Approximate Entity Extraction Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities ● For example: Dictionary of entities D ocuments Isaac newton Sigmund freund was an austrian psychiatrist who <Sigmund Freud founded the psychoanalytic school of psychology physicist phys Freud is best known for his theories of the astronomer unconscious mind and the defense mechanism of alchemist theologian repression and for creating the clinical practice of economist psychoanalysis for curing psychopathology sociologist Irou Th dialogue between a patient and a g sychoanalayst 1/28/2021 Taste@ ICDE2012 6/42
Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities. For example: Approximate Entity Extraction 1/28/2021 Taste @ ICDE2012 Isaac Newton Sigmund Freud physicist astronomer alchemist theologian economist sociologist ... Sigmund Freund was an Austrian psychiatrest who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalayst. Dictionary of Entities Documents 6/42
Outline ● Motivation ● Problem formulation Trie-based framework Trie-based algorithms o Optimizing partition Scheme ° Experiment ● Conclusion 1/28/2021 aste@ ICDE2012
Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion 1/28/2021 Taste @ ICDE2012 7/42
Problem formulation o Approximate Entity Extraction: Given a dictionary of entities E=ley e,,.,en, a document D, and a predefined edit distance threshold t, approximate entity extraction finds all"similar"pairs <S, e> such that ED(S, ej s t, where s is a substring of d and eie e (a) dictionary E (b) Document D ID Entities Len Document vancouver an efficient filter for approximates 2 vanateshe membership ChecKing kaushit 3 Ksurajit chaudrD tetretratttrti, >surajit chaudhuri, 4 caushit chaudui 15 vankatesh ganti dong rIn. 5 caushit chakra 15 vancouver, canada. sigmod 2008 1/28/2021 Taste@ ICDE2012
Problem Formulation 1/28/2021 Taste @ ICDE2012 Approximate Entity Extraction: Given a dictionary of entities E = {e1 , e2 , . . . , en }, a document D, and a predefined edit distance threshold τ, approximate entity extraction finds all “similar” pairs <s, ei> such that ED(s, ei) ≤ τ, where s is a substring of D and ei∈ E. 8/42
Edit distance ED(r s): The minimum number of single-character edit operations(insertion/deletion/substitution )to transform r to s o For example: ED(marios, maras)=2 8012345 m101 34 Dij=min (Di l, j +1,Di,jI+1,a 2 1 substitutea-p 321 Ⅰ sert o 0 if a,=b i4321 where t lifa.≠b 5432 654332 1/28/2021 Taste@ ICDE2012 9/42
ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(marios, maras) = 2 Di,0 = i, D0,j = j, Di,j = min{Di-1, j + 1, Di, j-1 + 1, Di-1, j-1 + t i, j}, 0 if ai = bj 1 if ai bj . Edit Distance 1/28/2021 Taste @ ICDE2012 9/42 where t i, j = Substitute a->i Insert o
State-of-the-art methods NGPP Faerie Inverted Index Prefix of 1-variant family g-grams Filtering a substring matches with Overlap must be larger Condition the prefix of 1-variant of than a threshold partition Shortage ● Large Index size Need to tune parameters Not efficient for large threshold 1/28/2021 Taste@ ICDE2012
State-of-the-art Methods Shortage: Large Index Size Need to Tune Parameters Not efficient for large threshold 1/28/2021 Taste @ ICDE2012 10/42 NGPP Faerie Inverted Index Prefix of 1-variant family q-grams Filtering Condition a substring matches with the prefix of 1-variant of partition Overlap must be larger than a threshold