Case Study 1: Domain Specific Data [DMP1 2] Goal: analysis of domain-specific structured data across the Web ◆ Questions addressed How is the data about a given domain spread across the Web? How easy is it to discover entities, sources in a given domain How much value do the tail entities in a given domain have?
Case Study I: Domain Specific Data [DMP12] Goal: analysis of domain-specific structured data across the Web Questions addressed: – How is the data about a given domain spread across the Web? – How easy is it to discover entities, sources in a given domain? – How much value do the tail entities in a given domain have? 21
Domain specific Data: Spread How many sources needed to build a complete DB for a domain? [DMP12] looked at 9 domains with the following properties Access to large comprehensive databases of entities in the domain Entities have attributes that are nearly unique identifiers, e. g ISBN for Books, phone number or homepage for restaurants Methodology of case study Used the entire web cache of Yahoo! search engine Webpage has an entity if it contains an identifying attribute Aggregate the set of all entities found on each website source)
Domain Specific Data: Spread How many sources needed to build a complete DB for a domain? [DMP12] looked at 9 domains with the following properties – Access to large comprehensive databases of entities in the domain – Entities have attributes that are (nearly) unique identifiers, e.g., ISBN for Books, phone number or homepage for Restaurants Methodology of case study: – Used the entire web cache of Yahoo! search engine – Webpage has an entity if it contains an identifying attribute – Aggregate the set of all entities found on each website (source) 22
Domain Specific Data: Spread Restaurants phones 1-coverage top10:93% top-100100% 0.4 0.2 strong aggregator source 100 100010000100000 of sources
Domain Specific Data: Spread 23 # of sources recall 1-coverage top-10: 93% top-100: 100% strong aggregator source
Domain Specific Data: Spread Restaurants phones 0.6 5-coverage top-500090% top-100k:95% 0.4 0.2 100 100010000100000 of sources
Domain Specific Data: Spread 24 # of sources recall 5-coverage top-5000: 90% top-100K: 95%
Domain Specific Data: Spre Restaurants homepage 0.8 1-coverage 0.2 top-100:80% 010K:95% 0100100010000100001e+06 of sources
Domain Specific Data: Spread 25 # of sources recall 1-coverage top-100: 80% top-10K: 95%