Domain Specific Data: Spre Restaurants homepage 0.8 coverage 0.2 010:35 top-10k:65% 0100100010000100001e+06 of sources
Domain Specific Data: Spread 26 # of sources recall 5-coverage top-100: 35% top-10K: 65%
Domain Specific Data: Spread Aggregate Reviews 09 0.8 0.7 0.6 05 0.4 All reviews are distinct 03 top-100:65% 0.2 top-1000:85% 0.1 10 100 00010000100000 of sources
Domain Specific Data: Spread 27 # of sources recall All reviews are distinct top-100: 65% top-1000: 85%
Domain specific Data: Connectivity How well are the sources"connected"in a given domain? Do you have to be a search engine to find domain-specific sources? .[DMP12] considered the entity-source graph for various domains Bipartite graph with entities and sources websites) as nodes Edge between entity e and source s if some webpage in s contains e Methodology ot case study: Study graph properties, e. g, diameter and connected components
Domain Specific Data: Connectivity How well are the sources “connected” in a given domain? – Do you have to be a search engine to find domain-specific sources? [DMP12] considered the entity-source graph for various domains – Bipartite graph with entities and sources (websites) as nodes – Edge between entity e and source s if some webpage in s contains e Methodology of case study: – Study graph properties, e.g., diameter and connected components 28
Domain specific Data: Connectivity t Almost all entities are connected to each other Largest connected component has more than 99%of entities Graph Avg #sites diameter #conn. en(ities in Domain Attr per entity Books ISBN 99.96 Automotive phone 99.99 Banks pho 99.99 Home phone 4507 99.76 Hotels 99.99 99.99 Restaurants phone 99.99 Retail 99.93 Schools phone Automotive he homepage Hotels homepage 121642135626545 6686667668886676 9852 99.57 5496 9787 99.90 Libraries homepage 9986 Restaurants homepage 99.82 Retail homepage 99.20 Schools metage 74 99.57
Almost all entities are connected to each other – Largest connected component has more than 99% of entities Domain Specific Data: Connectivity 29
Domain specific Data: Connectivity High redundancy and overlap ena nable use of bootstrapping Low diameter ensures that most sources can be found quickly Graph Avg #sites dameter#conn. entities in Attr Books ISBN 99.96 Automotive phone 99.99 Banks pho 99.99 Home phone 4507 99.76 Hotels 121642135626545 6686667668886676 99.99 99.99 Restaurants phone 99.99 99.93 Schools phone Automotive he 9852 99.57 homepage 5496 9787 Hotels homepage 99.90 Libraries homepage 9986 Restaurants homepage 99.82 Retail homepage 99.20 Schools metage 74 99.57
High redundancy and overlap enable use of bootstrapping – Low diameter ensures that most sources can be found quickly Domain Specific Data: Connectivity 30