Outline Overview of web search Next generation search engines CCF-ADL at Zhengzhou University, 2 June25-27,2010
Outline • Overview of web search • Next generation search engines 2 CCF-ADL at Zhengzhou University, June 25-27, 2010
Characteristics of Web Information "Infinite"size (Surface vs.deep Web) Surface static HTML pages Deep=dynamically generated HTML pages(DB) ·Semi-structured -Structured HTML tags,hyperlinks,etc Unstructured Text Different format (pdf,word,ps,...) 。Multi--media(Textual,,audio,images,…) High variances in quality(Many junks) "Universal"coverage(can be about any content) CCF-ADL at Zhengzhou University, June25-27,2010
Characteristics of Web Information • “Infinite” size (Surface vs. deep Web) – Surface = static HTML pages – Deep = dynamically generated HTML pages (DB) • Semi-structured – Structured = HTML tags, hyperlinks, etc – Unstructured = Text • Different format (pdf, word, ps, …) • Multi-media (Textual, audio, images, …) • High variances in quality (Many junks) • “Universal” coverage (can be about any content) 3 CCF-ADL at Zhengzhou University, June 25-27, 2010
General Challenges in Web Information Management Handling the size of the Web How to ensure completeness of coverage? Efficiency issues Dealing with or tolerating errors and low quality information Addressing the dynamics of the Web Some pages may disappear permanently New pages are constantly created CCF-ADL at Zhengzhou University, 4 June25-27,2010
General Challenges in Web Information Management • Handling the size of the Web – How to ensure completeness of coverage? – Efficiency issues • Dealing with or tolerating errors and low quality information • Addressing the dynamics of the Web – Some pages may disappear permanently – New pages are constantly created 4 CCF-ADL at Zhengzhou University, June 25-27, 2010
“Free text'"vs.“Structured text" So far,we've assumed "free text" Document word sequence -Query word sequence Collection a set of documents -Minimal structure .. But,we may have structures on text(e.g.,title, hyperlinks) Can we exploit the structures in retrieval? CCF-ADL at Zhengzhou University, June25-27,2010
“Free text” vs. “Structured text” • So far, we’ve assumed “free text” – Document = word sequence – Query = word sequence – Collection = a set of documents – Minimal structure … • But, we may have structures on text (e.g., title, hyperlinks) – Can we exploit the structures in retrieval? 5 CCF-ADL at Zhengzhou University, June 25-27, 2010
Examples of Document Structures Intra-doc structures(=relations of components) Natural components:title,author,abstract, sections,references,.. Annotations:named entities,subtopics, markups,… Inter-doc structures (=relations between documents) Topic hierarchy Hyperlinks/citations (hypertext) CCF-ADL at Zhengzhou University, June25-27,2010 6
Examples of Document Structures • Intra-doc structures (=relations of components) – Natural components: title, author, abstract, sections, references, … – Annotations: named entities, subtopics, markups, … • Inter-doc structures (=relations between documents) – Topic hierarchy – Hyperlinks/citations (hypertext) 6 CCF-ADL at Zhengzhou University, June 25-27, 2010