Similarity Measures in Deep Web Data Integration Fangliao Jiang
Similarity Measures in Deep Web Data Integration Fangjiao Jiang
Outline Motivation Brief review on Existing Similarity Measures Challenges and Our Solutions Conclusion
Outline ◼ Motivation ◼ Brief Review on Existing Similarity Measures ◼ Challenges and Our Solutions ◼ Conclusion
Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion
Outline ◼ Motivation ◼ Brief Review on Existing Similarity Measures ◼ Challenges and Our Solutions ◼ Conclusion
Similarity measure an essential point in data integration Variations from Representation Typographical errors, misspellings, abbreviations, etc Extraction From unstructured or semi-unstructured documents or web pages 44W. 4th st 44 West fourth Street Smith Abroms “KFC Smoth abrams "Kentucky fried chicken' fR. Smith" i Richard smith'h
Similarity measure — an essential point in data integration Variations from: ◼ Representation Typographical errors, misspellings, abbreviations, etc ◼ Extraction From unstructured or semi-unstructured documents or web pages Smith Smoth 44 W. 4th St. 44 West Fourth Street "R. Smith" " Richard Smith" Abroms Abrams “KFC" “Kentucky Fried Chicken
Similarity measure an essential point in data integration Similarity measure will be applied to I Keyword search From key word query interface to structured query interface Schema matching From integrated query interface to local query interface Result merge Duplicate records detection(field level) Q={keyl,key2,…} Record record 1 Integrated Interface=k<ClV1>,<C2.V2>, Record Record2 ocal Interface=<LLv1>, <L2 v2>, Record3Record3 Record4→ Record4
Similarity measure — an essential point in data integration ◼ Similarity measure will be applied to: ◼ Keyword search From keyword query interface to structured query interface ◼ Schema matching From integrated query interface to local query interface ◼ Result merge Duplicate records detection (field level ) Q ={ , , …} Integrated Interface={ , ,…} Local Interface={ , ,…} key1 key2 <C1,V1><C2,V2> <L1,v1> <L2,v2> Record1 Record2 Record3 Record4 Record1 Record2 Record3 Record4