MASSACHVSETTS INSTITVTE OF TECHNOLOGY Depart ment of Electrical Engineering and Computer Science 6. 001-Structure and Interpret at ion of Computer Programs Spring Semester, 2005 Issued: Tuesday, March 15, 2005 Solut ions due on online tutor: Friday, April 1, 2005 by 6: 00 PM Crawling and Indexing the World wide Web This project explores some issues t hat arise in constructing a"spider"or a"web agent"that craw ls over document s in t he World wide Web. For purposes of this project, the Web is an extremely large collect ion of do cuments. each document cont ains some text and also links to ot her documents. in the form of urls In this project, we'll be working wit h programs that can start wit h an initial document and follow t he references to ot her document s to do useful things. For example, we could construct an index of all the words occurring in do cuments, and make thi ilable to people looking for informat ion on the web(as do many of the search engines on the web, such as Google or Yahoo Just in case you arent fluent wit h the det ails of Http, Urls, Uris Html, Xml, Xsl, Htt NG, DOM, and the rest of the alphabet soup that makes up the technical det ails of the Web, heres a simplified version of what goes on behind the scenes 1. The Web consists of a very large number of things called document s, ident ified by names called URLs(Uniform Resource Locators). For example, the oCw home page has the URL Urlhttp://ocw.mit.edu/.ThefirstportionofaUrl(Http://)revealsthenameofa protocol (in this case hypertext transmission protocol or Http) That can be used to fetch the do cument, and the rest of the URL cont ains informat ion needed by the protocol to specify which do cument is intended.(A protocol is a particular set of rules for how to communicate By using the Http protocol a program(most commonly a browser but any program can do this"web agent s"and spiders are examples of such programs that aren't browsers canretrieveadocumentwhoseUrlstartswithhttp.Thedocumentisreturnedtothe program, along wit h informat ion about how it is encoded, for example, ASCIi or Unico de text, HTML, images in gif or JPG or MPEG or PNG or some other format, an Excel or et, etC. 3. Document s enco ded in HTML(Hy per Text Markup Language)form can cont ain a mixture of text, images format ting informat ion, and links to ot her document s. Thus, when a browser (or ot her program) gets an HTML document, it can extract the links from it, yielding URL: for ot her document s in the Web. If these are in HTML format t hen t hey too can be retrieved and will yield yet more links and so on 4. A spi der is a program that st arts wit h an initial set of URLs, retrieves the corresponding documents, adds the links from these documents to the set of URLs and keeps on going Every time it retrieves a do cument, it does some(hopefully useful) work in addit ion to jus finding the embedded link
!""# $% %& '""( %%) %*& + (& '""( % ) *& & '""( ,* )"" $ +% - .% % %%% + % % /%0 /1, 0 + 1% 2 % + 3 3 3,! %% +% -& + 3, % .* %! + % % . % 4% + %& + 56 %! +% -& 17 , 14 1+ % + % 1+ 1 + % + % % +%! .& 1 % . + 1% %& 4 +% 2, 4 + 1, 8% * + %+ % + 1,& %+ % +9! :% % * 7 ; 1+ + % $& 56 %& 56 %& & < & < & $= & & + % + +, % + 4% + + % + 3,& +7% %> 2% 1+ % ,+ + %%) ! + 3, %%% 2* , +% %& > ,* % 56 % 85 6% %9! .& + + +% + 56 ! + >% 56 89 2% + 8 +% % +*. %%% & $9 + , % + + & + % + 56 % ,* + %* 1++ % ! 8 % % % +1 !9 '! ?* % + $ & 8% * ,1% , * +%# /1, %0 %% .% %+ % + 7 ,1%%9 2 1+% 56 %% 1+ ! + % + & 1+ , +1 % & .& 5 .& & % :$ $ $ % + & . % %%+& ! @! % 8*. 4 9 . .& %& & 4% + %! +%& 1+ ,1% 8 + 9 % & . + 4% & * 56 % + % + 3,! +% & + +* , 2 1 * * 4%& % ! A! % + %% 1+ % 56 %& 2% + % %& % + 4% +% % + % 56 % 4% ! 2* 2% & % % 8+* %9 14 -% > + , 4%! OCW URL http://ocw.mit.edu/
Hlooject sing cemeste 3e 2oo5--P3o ject 3 5. xne particularly interest ing kind of spider constructs an 2VSCEof the document s it has seen dhis index is similar to the index at the end of a book: it has certain key words and phrases, and for each entry it lists all of the URhs that cont ain that word or phrase. dhere are many kinds of indexes, and the art /science of deciding what words or phrases to index and how to extract them is at the cutting edge of research(it's part of the discipline called VTENCHLV HFFACLCC We'll talk most ly about Y(GIFF ASCEVD which means that every word in the document( except, perhaps, the very common words like“and,”“th and“an”)is In this project, we'll be interested in three tasks related to searching the World Wide Web. t irst e will develop a way to think about the web"of links as a directed graph. wecond, we will build procedures to help in traversing or searching through graphs such as the Web. dhird, we will consider ways to build an index for some set of web pages to support fast retrieval of URhs that contain a given word. Directed graphs dhe essence of the Web, for the purpose of underst anding the process of searching, is captured by a formal abstraction called a SHeIG IECMS. a graph(like the one in t igure O, consists of VI SC1 and (IC1. In this figure, the nodes are labelled U through Z. eodes are connected to other no des via (SiCL In a directed graph, each edge has a direct on so that the existence of an I YEII /D (SIC from one node to anot her node(e. g. nodeX to node o) does not imply that there is an edge in the reverse direct ion(e.g. from node o to node X). e otice that there can be mult iple out going edges from a no de as well as mult iple iVel NAvDedges to a node, e.g. there are edges from both o and Z to W. dhe set of nodes reachable via a single out going edge from a given node is referred to as t he node,s ep SiV. tor example, the children of node W are nodes U and X. hast ly, a graph aid to cont ain a cy cle if you st art from some node and manage to ret urn to that same node after averring one or more edges. wo for example, the nodes w, X and o form a cycle, as does the node n by it self a second example of a directed graph is shown in t igure 2. dhis particular directed graph happens to be a tree: each node is pointed to by only one ot her node and thus there is no sharing of no des, and there are no cycles(or loops) In order to traverse a directed graph, let' s assume that we have two selectors for getting informat ion from the graph s(find-node-children IICM VISo ret urns a list of the nodes in IiMp that can be reached in one step by out bound edges from visC tor example, in t igure 2 the children of node B are I, b, g, and i -things that can be reached in one hop by an out going edge s(find-node-content s Iiap VISO ret urns t he contents of the node. tor example, when we represent the web as a graph, we will want the node content s to be an alphabet ized list of all of the words occurring in the document at VI SC eote, we have not said anyt hing yet about the act ual represent at ion of a graph, a no de, or an edge We are simply st at ing an abstract definition of a dat a structure
' (! * % 4 % %% + % +% %! +% . % % + . + ,4) +% 4* 1% +%%& + * %% + 56 % + + 1 +%! + * 4% .%& + B% 1+ 1% +%% . +1 . + % + %+ 87% + % 9! 37 4 %* , & 1++ % + 2* 1 + 8.& +%& + 2* 1% 4 /&0 /+&0 /&0 /09 % .! +% -& 17 , % + %4% %+ + 3 3 3,! %& 1 1 2 1* +4 , + /1,0 4% % +! & 1 1 , % + 2% %+ ++ +% %+ % + 3,! +& 1 1 % 1*% , . % % 1, % % % 2 56 % + 2 1! + %% + 3,& + % % + %% %+& % ,* ,% ! + 84 + 9& %%% ! +% >& + % , 5 ++ C! % + % 2 ! +& + +% % + + .% + 8!! < 9 % * + + % + 2% 8!! <9! + + , % % 1 % % & !! + % ,+ C 3! + % % +, 2 % 2 % % + 7% ! .& + + 3 % 5 <! %*& + % % * * % % + % 2% %! .& + % 3& < *& % % + ,* %! % . + % %+1 '! +% + +% , ) + % ,* * + +% + % %+ %& + *% 8 %9! 2% +& 7% %% + 1 +2 1 %% + +) % % + % + , + % ,* , % ! .& ' + + ? & & & D +% + , + + ,* ! % + % + ! .& 1+ 1 % + 1, % +& 1 1 1 + % , +,E % + 1% + ! & 1 +2 % *+ * , + % +& & ! 3 %* % ,% > %!
6.001, Spring Semester, 2005--Pro ject 3 Figure 1: An example of a general graph A M C DEH igure 2: An example of a tree, viewed as a directed graph
@ ) . +! ') . & 21 % +!
6.001, Spring Semester, 2005--Pro ject 3 1.1 The Web as a graph The Web it self can be thought of as a directed graph in which the nodes are HTML document s and t he edges are hyperlinks to other HTML document s. For example, in Figure 2 the no de labeled B would be a URL, and a directed edge exists between two nodes B and E if the document represented by node B cont ains a link to the do cument represented by node e(as it does in this case) As ment ioned earlier, a web spider (or web crawler) is a program that traverses the web. A web spider might support procedures such as 1 Forj scURLcgrjks web URD ret urns a list of the URLs that are out bound links from URL 1 FoEj SCURLC. Imx. web URD) ret urns an alphabet ized list of all of the words occurring in the document at URL Note, we have not said any thing yet about the act ual representat ion of the web. We are simply st at ing an abstract definition of a dat a struct ure In a real web craw ler, oFj scURLCgFjkS would involve retrieving the document over the network using its URL, parsing the HTML informat ion ret urned by the web server, and extract ing the link information from <n 2REF=iii>, <pe ngm Srr=iii> and similar tags. Simila a real web craw ler, ForjScURLC Ix. web URD would retrieve the document, discard all of the mark-up commands(such as <b3sy>, <6. eg>, <tg>, etc), alphabet ize(and remove duplicates from) the lt ing list of word For this pro ject our programs will not actually go out and retrieve document s over the web. Inst we will represent a collection of web do cuments as a graph as discussed earlier. When you load t he code for t his project, you will have available a global variable,. 6ncwnb, w hich holds the graph represent at ion for a set of do cuments for use in t his project Not e: it is import ant to separate our particular represent ation of informat ion on the web from the idea of the web as a loose collect ion of documents. We are choosing to use a graph to capture a simple version of the web - this is simply to provide us wit h a concrete representat ion of the web, so that we can examine issues related to exploring the web. In pract ice, we would never build an ent ire graph represent ation of the web, we would simply take advant age of the abstraction of conceptualize the struct ure of the web, especially since it is a dynamic thing that const antly ch Our implement ation of opj scURlcgrjkS and opj ScURLC. mx. will simply use the graph pro cedures to get web links(children)and web page contents define(find-URL-links web url (find-node-children web url)) (define (find-URL-text web url) (find-node-contents web url)) In ot her words, we are convert ing operat ions that would normally apply to the web itself into operat ions t hat work on the internal represent ation of t he web as a graph
A + 3, % , ++ % + 1++ + % % + % +*4% + %! .& ' + , ? 1 , 56 & .%% ,1 1 % ? + % ,* ? % 4 + % ,* 8% % +% %9! % & 1, % 8 1, 19 % + 2%% + 1,! 1, % + % % %+ %) % % + 56 % + , 4% ! % +,E % + 1% + ! & 1 +2 % *+ * , + % + 1,! 3 %* % ,% > %! 1, 1& 1 22 2 + 2 + 14 % % 56 & % + ,* + 1, %2& . + 4 !"& # !" % %! *& 1, 1& 1 2 + & % + 4= % 8%+ % $ %"& "& "& !9& +,E 8 2 % 9 + .& + % % 1%! +% - % 1 * 2 % 2 + 1,! %& 1 1 % 1, % % % %%% ! 3+ * + +% -& * 1 +2 2, , 2,& &$& 1++ +% + + % % % % +% -! % % % + 1, + + 1, % % %! 3 +% % + % 2% + 1, D +% % %* 2 % 1+ % + 1,& % + 1 . %%% . + 1,! & 1 1 2 , + % + 1,& 1 1 %* 4 2 + ,% E + % + 1,& %* % % * + + %* +%! 1 %* % + + % 1, 4% 8+9 1, %) + 1%& 1 2 % + 1 * * + 1, % % + 14 + % + 1, % +!
P 2 Ditected Gtpph AT sttpctiH We will build a graph abstraction to capt ure the relationships as shown in Figures s and 2, as well as enable us to have some contents at each node. You should st udy the co de in HFI nEl THEO provided with the project very closely; parts of it are described in the following discussion We will assume that our graph is represented as a collect ion of graph-elements. Each graph-element will itself consist of a node (represented as a sy mbol-the name of the node), a list of children no des, and some contents stored at the node (which in general can be of any type). The constructors, r the -nl Alt enfCer mabst l rrr rapn Abe-raf-i(s a f(llef-i(s (f rapnielemes-e rrr rapniElemes- a s(te, u-g(isg filtres fr(m -ne ste, ast f(s-es-e (f -ne s(t rrr N(te emb(1 a ey mb(l label (r same f(r -ne s(te (s-es-e= asy -ype ne f(s-es-e (f -ne s te rr111111111111111 rr rapniElemes- r makeigrapnielemes-: Nte, lie-<Nte>,C(s-es-e i> Elemes htefise hmakeigrapnielemes- s(te fniltres f(s-es-e) hlie-'grapnielemes-s(te filtres f(s-es-e)) htefise hgrapnielemes es-) r asy-ype i> b((leas hast hpair? elemes-) grapnielemes- hfar elemes-)))) s(te h-ne Same) fr(m -ne rapnielemes htefise hgrapnielemes-i>s(te elemes-) r rapnielemes- i>N(te hif hs(-hgrapnielemes-? elemes-)) herr(r "(bjef- s(- elemes-:" elemes-) hire- hftr el r e- -ne filtres ha lie-(f (u-g(isg s(te samee)fr(m -ne rapniElemes- htefise hgrapnielemes-i>fniltres elemes-)r rapniElemes- 1> lie-<N (te> hif hs (-hgrapnielemes-? elemes-)) herr(r"(bjef- s(-elemes-:" elemes-) heef(st hftr elemes-)))) fr(m -ne rapn htefise hgrapnielemes-if(s-es-e elemes-) r rapniElemes- i> C(s-es-e hif hs ( hgrapnielemes-? elemes-)) n elemes-) h-nirt hftr elemes-)))) Given this represent at ion for a graph-element, we can build the graph out of these element s as follows htefise hmakeigrapn elemes-e) r lie-<Elemes->i> rapn hf(se 'gran elemes-e))
( 3 1 , + ,% + %+% % %+1 % '& % 1 % , % +2 % % + ! %+ %* + 2 1+ + - 2* %*F % %, + 1 %%%! 3 1 %% + + % % % +=%! + += 1 % %% 8% % %*, D + + 9& % + %& % % % + 81++ , * *9! + %%& * & %%% + # ,% %+1 ,1) ! " " # ! "" $ % &# & ' ( "" & ( )( ' & & ( *+ $ * & & % & ( *+ $ * & & # ( *+ $ * 2 +% % +=& 1 , + + +% % % 1%) % & & '