More implementation issues: Parsing HTML has the structure of a dom Document Object Model)tree <html> <titlesHere comes the DoM</title> Unfortunately actual HTML is often shers incorrect in a strict syntactic sense <h2>Document Object Model</h2> <img align-right alt dom pict" sre-"dom.png"> Crawlers. like browsers must be This is a simple <code>hTML</code> page to illustrate the robust/forgiving cahrefa"http://ww.w3.org/dom/">dom</a> Fortunately there are tools that can c/html> help E. g. tidy. sourceforge. net Must pay attention to HTML entities and unicode in text ←hm Document What to do with a growing number img of other formats? Flash, SVG. RSS, AJAX code -HTMLI Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer More implementation issues: Parsing • HTML has the structure of a DOM (Document Object Model) tree • Unfortunately actual HTML is often incorrect in a strict syntactic sense • Crawlers, like browsers, must be robust/forgiving • Fortunately there are tools that can help – E.g. tidy.sourceforge.net • Must pay attention to HTML entities and unicode in text • What to do with a growing number of other formats? – Flash, SVG, RSS, AJAX…
More implementation issues Stop words Noise words that do not carry meaning should be eliminated Stopped")before they are indexed E.g. in English: AND, THE, A, AT, OR, ON, FOR,etc Typically syntactic markers Typically the most common terms Typically kept in a negative dictionary ·10-1000e| ements E.g.http://ir.dcs.glaac.uk/resources/linquisticutils/stopwords Parser can detect these right away and disregard them Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer More implementation issues • Stop words – Noise words that do not carry meaning should be eliminated (“stopped”) before they are indexed – E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc… – Typically syntactic markers – Typically the most common terms – Typically kept in a negative dictionary • 10–1,000 elements • E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words – Parser can detect these right away and disregard them
More implementation issues Conflation and thesauri Idea: improve recall by merging words with same meaning We want to ignore superficial morphological features thus merge semantically similar tokens student, study, studying, studious]=> studi 2. We can also conf late synonyms into a single form using a thesaurus 30-50% smaller index Doing this in both pages and queries allows to retrieve pages about automobile'when user asks for car Thesaurus can be implemented as a hash table Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer More implementation issues Conflation and thesauri • Idea: improve recall by merging words with same meaning 1. We want to ignore superficial morphological features, thus merge semantically similar tokens – {student, study, studying, studious} => studi 2. We can also conflate synonyms into a single form using a thesaurus – 30-50% smaller index – Doing this in both pages and queries allows to retrieve pages about ‘automobile’ when user asks for ‘car’ – Thesaurus can be implemented as a hash table
More implementation issues Stemming Morphological conflation based on rewrite rules Language dependent Porter stemmer very popular for English http://www.tartarus.org/vmartin/porterstemmer/ Context-sensitive grammar rules, eg IES"except (EIES"or"AIES)-->y Versions in perl, c java, Python c# Ruby php,etc Porter has also developed snowball, a language to create stemming algorithms in any language http://snowball.tartarusorg Ex Perl modules: Lingua: Stem and Lingua: Stem: Snowball Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer More implementation issues • Stemming – Morphological conflation based on rewrite rules – Language dependent! – Porter stemmer very popular for English • http://www.tartarus.org/~martin/PorterStemmer/ • Context-sensitive grammar rules, eg: – “IES” except (“EIES” or “AIES”) --> “Y” • Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc. – Porter has also developed Snowball, a language to create stemming algorithms in any language • http://snowball.tartarus.org/ • Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball
More implementation issues Static Vs. dynamic pages Is it worth trying to eliminate dynamic pages and only index static pages? Examples http://www.census.gov/cqi-bin/gazetteer http://informatics.indianaedu/research/colloquia.as http://www.amazon.com/exec/obidos/subst/home/home.htmi/002-8332429-6490452 http://www.imdb.com/name?Menczer+erico http://www.imdbcom/name/nm0578801/ Why or why not? How can we tell if a page is dynamic? What about 'spider traps? What do Google and other search engines do Slides 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 200 Ch 8 Web Crawing by Filippo Menczer informatics
Slides © 2007 Filippo Menczer, Indiana University School of Informatics Bing Liu: Web Data Mining. Springer, 2007 Ch. 8 Web Crawling by Filippo Menczer More implementation issues • Static vs. dynamic pages – Is it worth trying to eliminate dynamic pages and only index static pages? – Examples: • http://www.census.gov/cgi-bin/gazetteer • http://informatics.indiana.edu/research/colloquia.asp • http://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452 • http://www.imdb.com/Name?Menczer,+Erico • http://www.imdb.com/name/nm0578801/ – Why or why not? How can we tell if a page is dynamic? What about ‘spider traps’? – What do Google and other search engines do?