当前位置：和泉文库 > 数学 > 浏览文档

《线性代数》课程教学资源（文献资料）THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE

文件格式：PDF，文件大小：283.69KB，售价：2.96元

文档详细内容（约11页）

THE S25.000.000.000*EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE KURT BRYANT AND TANYA LEISE Abstract.Google's success derives in large part from its PageRank algorithm,which ranks the importance of webpages according to an eigenvector of a weighted link matrix.Analysis of the PageRank formula provides a wonderful applied topic for a linear algebra course.Instructors may assign this article as a project to more advanced students,or spend one or two lectures presenting the material with assigned homework from the exercises.This material also complements the discussion of Markov chains in matrix algebra.Maple and Mathematica files supporting this material can be found at www.rose-hulman.edu/~bryan. Key words.linear algebra,PageRank,eigenvector,stochastic matrix AMS subject classifications.15-01,15A18,15A51 1.Introduction.When Google went online in the late 1990's,one thing that set it apart from other search engines was that its search result listings always seemed deliver the "good stuff" up front.With other search engines you often had to wade through screen after screen of links to irrelevant web pages that just happened to match the search text.Part of the magic behind Google is its PageRank algorithm,which quantitatively rates the importance of each page on the web,allowing Google to rank the pages and thereby present to the user the more important (and typically most relevant and helpful)pages first. Understanding how to calculate PageRank is essential for anyone designing a web page that they want people to access frequently,since getting listed first in a Google search leads to many people looking at your page.Indeed,due to Google's prominence as a search engine,its ranking system has had a deep influence on the development and structure of the internet,and on what kinds of information and services get accessed most frequently.Our goal in this paper is to explain one of the core ideas behind how Google calculates web page rankings.This turns out to be a delightful application of standard linear algebra. Search engines such as Google have to do three basic things: 1.Crawl the web and locate all web pages with public access. 2.Index the data from step 1,so that it can be searched efficiently for relevant keywords or phrases. 3.Rate the importance of each page in the database,so that when a user does a search and the subset of pages in the database with the desired information has been found,the more important pages can be presented first. This paper will focus on step 3.In an interconnected web of pages,how can one meaningfully define and quantify the "importance"of any given page? The rated importance of web pages is not the only factor in how links are presented,but it is a significant one.There are also successful ranking algorithms other than PageRank.The interested reader will find a wealth of information about ranking algorithms and search engines,and we list just a few references for getting started (see the extensive bibliography in [9],for example,for a more complete list).For a brief overview of how Google handles the entire process see [6],and for an in-depth treatment of PageRank see [3]and a companion article [9].Another article with good concrete examples is [5].For more background on PageRank and explanations of essential principles of web design to maximize a website's PageRank,go to the websites [4,11,14].To find out more about search engine principles in general and other ranking algorithms,see [2 and [8.Finally,for an account of some newer approaches to searching the web,see [12]and [13] 2.Developing a formula to rank pages. *THE APPROXIMATE MARKET VALUE OF GOOGLE WHEN THE COMPANY WENT PUBLIC IN 2004. Department of Mathematics,Rose-Hulman Institute of Technology,Terre Haute,IN 47803;email: kurt.bryan@rose-hulman.edu;phone:812)877-8485;fax:(812)877-8883. Mathematics and Computer Science Department,Amherst College,Amherst,MA 01002; email tleise@amherst.edu;phone:(413)542-5411;fax:(413)542-2550. 1

THE $25,000,000,000∗ EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE KURT BRYAN† AND TANYA LEISE‡ Abstract. Google’s success derives in large part from its PageRank algorithm, which ranks the importance of webpages according to an eigenvector of a weighted link matrix. Analysis of the PageRank formula provides a wonderful applied topic for a linear algebra course. Instructors may assign this article as a project to more advanced students, or spend one or two lectures presenting the material with assigned homework from the exercises. This material also complements the discussion of Markov chains in matrix algebra. Maple and Mathematica files supporting this material can be found at www.rose-hulman.edu/∼bryan. Key words. linear algebra, PageRank, eigenvector, stochastic matrix AMS subject classifications. 15-01, 15A18, 15A51 1. Introduction. When Google went online in the late 1990’s, one thing that set it apart from other search engines was that its search result listings always seemed deliver the “good stuff” up front. With other search engines you often had to wade through screen after screen of links to irrelevant web pages that just happened to match the search text. Part of the magic behind Google is its PageRank algorithm, which quantitatively rates the importance of each page on the web, allowing Google to rank the pages and thereby present to the user the more important (and typically most relevant and helpful) pages first. Understanding how to calculate PageRank is essential for anyone designing a web page that they want people to access frequently, since getting listed first in a Google search leads to many people looking at your page. Indeed, due to Google’s prominence as a search engine, its ranking system has had a deep influence on the development and structure of the internet, and on what kinds of information and services get accessed most frequently. Our goal in this paper is to explain one of the core ideas behind how Google calculates web page rankings. This turns out to be a delightful application of standard linear algebra. Search engines such as Google have to do three basic things: 1. Crawl the web and locate all web pages with public access. 2. Index the data from step 1, so that it can be searched efficiently for relevant keywords or phrases. 3. Rate the importance of each page in the database, so that when a user does a search and the subset of pages in the database with the desired information has been found, the more important pages can be presented first. This paper will focus on step 3. In an interconnected web of pages, how can one meaningfully define and quantify the “importance” of any given page? The rated importance of web pages is not the only factor in how links are presented, but it is a significant one. There are also successful ranking algorithms other than PageRank. The interested reader will find a wealth of information about ranking algorithms and search engines, and we list just a few references for getting started (see the extensive bibliography in [9], for example, for a more complete list). For a brief overview of how Google handles the entire process see [6], and for an in-depth treatment of PageRank see [3] and a companion article [9]. Another article with good concrete examples is [5]. For more background on PageRank and explanations of essential principles of web design to maximize a website’s PageRank, go to the websites [4, 11, 14]. To find out more about search engine principles in general and other ranking algorithms, see [2] and [8]. Finally, for an account of some newer approaches to searching the web, see [12] and [13]. 2. Developing a formula to rank pages. ∗THE APPROXIMATE MARKET VALUE OF GOOGLE WHEN THE COMPANY WENT PUBLIC IN 2004. †Department of Mathematics, Rose-Hulman Institute of Technology, Terre Haute, IN 47803; email: kurt.bryan@rose-hulman.edu; phone: 812) 877-8485; fax: (812)877-8883. ‡Mathematics and Computer Science Department, Amherst College, Amherst, MA 01002; email: tleise@amherst.edu; phone: (413)542-5411; fax: (413)542-2550. 1

2 K.BRYAN AND T.LEISE 3 2 FIG.2.1.An erample of a web with only four pages.An arrow from page A to page B indicates a link from page A to page B. 2.l.The basic idea.In what follows we will use the phrase“importance score'”or just“score” for any quantitative rating of a web page's importance.The importance score for any web page will always be a non-negative real number.A core idea in assigning a score to any given web page is that the page's score is derived from the links made to that page from other web pages.The links to a given page are called the backlinks for that page.The web thus becomes a democracy where pages vote for the importance of other pages by linking to them. Suppose the web of interest contains n pages,each page indexed by an integer k,1<k<n.A typical example is illustrated in Figure 2.1,in which an arrow from page A to page B indicates a link from page A to page B.Such a web is an example of a directed graph.I We'll use zk to denote the importance score of page k in the web.The zk is non-negative and z;>zk indicates that page j is more important than page k(so z;=0 indicates page j has the least possible importance score). A very simple approach is to take k as the number of backlinks for page k.In the example in Figure 2.1,we have z1 =2,2 =1,z3 =3,and 4=2,so that page 3 would be the most important, pages 1 and 4 tie for second,and page 2 is least important.A link to page k becomes a vote for page k's importance. This approach ignores an important feature one would expect a ranking algorithm to have, namely,that a link to page k from an important page should boost page k's importance score more than a link from an unimportant page.For example,a link to your homepage directly from Yahoo ought to boost your page's score much more than a link from,say,www.kurtbryan.com(no relation to the author).In the web of Figure 2.1,pages 1 and 4 both have two backlinks:each links to the other,but page 1's second backlink is from the seemingly important page 3,while page 4's second backlink is from the relatively unimportant page 1.As such,perhaps we should rate page 1's importance higher than that of page 4. As a first attempt at incorporating this idea let's compute the score of page j as the sum of the scores of all pages linking to page j.For example,consider the web of Figure 2.1.The score of page 1 would be determined by the relation 1=r3+x4.Since r3 and z4 will depend on i this scheme seems strangely self-referential,but it is the approach we will use,with one more modification.Just as in elections,we don't want a single individual to gain influence merely by casting multiple votes. In the same vein,we seek a scheme in which a web page doesn't gain extra influence simply by linking to lots of other pages.If page j contains nj links,one of which links to page k,then we will boost page k's score by i/nj,rather than by xj.In this scheme each web page gets a total of one vote,weighted by that web page's score,that is evenly divided up among all of its outgoing links.To quantify this for a web of n pages,let LkC{1,2,...,n}denote the set of pages with a link to page k,that is,Lk is the set of page k's backlinks.For each k we require k=tj 防 (2.1) j∈Lk where nj is the number of outgoing links from page j(which must be positive since if j e Lk then 1A graph consists of a set of vertices(in this context,the web pages)and a set of edges.Each edge joins a pair of vertices.The graph is undirected if the edges have no direction.The graph is directed if each edge (in the web context,the links)has a direction,that is,a starting and ending vertex

2 K. BRYAN AND T. LEISE 2 4 1 3 Fig. 2.1. An example of a web with only four pages. An arrow from page A to page B indicates a link from page A to page B. 2.1. The basic idea. In what follows we will use the phrase “importance score” or just “score” for any quantitative rating of a web page’s importance. The importance score for any web page will always be a non-negative real number. A core idea in assigning a score to any given web page is that the page’s score is derived from the links made to that page from other web pages. The links to a given page are called the backlinks for that page. The web thus becomes a democracy where pages vote for the importance of other pages by linking to them. Suppose the web of interest contains n pages, each page indexed by an integer k, 1 ≤ k ≤ n. A typical example is illustrated in Figure 2.1, in which an arrow from page A to page B indicates a link from page A to page B. Such a web is an example of a directed graph.1 We’ll use xk to denote the importance score of page k in the web. The xk is non-negative and xj > xk indicates that page j is more important than page k (so xj = 0 indicates page j has the least possible importance score). A very simple approach is to take xk as the number of backlinks for page k. In the example in Figure 2.1, we have x1 = 2, x2 = 1, x3 = 3, and x4 = 2, so that page 3 would be the most important, pages 1 and 4 tie for second, and page 2 is least important. A link to page k becomes a vote for page k’s importance. This approach ignores an important feature one would expect a ranking algorithm to have, namely, that a link to page k from an important page should boost page k’s importance score more than a link from an unimportant page. For example, a link to your homepage directly from Yahoo ought to boost your page’s score much more than a link from, say, www.kurtbryan.com (no relation to the author). In the web of Figure 2.1, pages 1 and 4 both have two backlinks: each links to the other, but page 1’s second backlink is from the seemingly important page 3, while page 4’s second backlink is from the relatively unimportant page 1. As such, perhaps we should rate page 1’s importance higher than that of page 4. As a first attempt at incorporating this idea let’s compute the score of page j as the sum of the scores of all pages linking to page j. For example, consider the web of Figure 2.1. The score of page 1 would be determined by the relation x1 = x3 + x4. Since x3 and x4 will depend on x1 this scheme seems strangely self-referential, but it is the approach we will use, with one more modification. Just as in elections, we don’t want a single individual to gain influence merely by casting multiple votes. In the same vein, we seek a scheme in which a web page doesn’t gain extra influence simply by linking to lots of other pages. If page j contains nj links, one of which links to page k, then we will boost page k’s score by xj/nj , rather than by xj . In this scheme each web page gets a total of one vote, weighted by that web page’s score, that is evenly divided up among all of its outgoing links. To quantify this for a web of n pages, let Lk ⊂ {1, 2, . . . , n} denote the set of pages with a link to page k, that is, Lk is the set of page k’s backlinks. For each k we require xk = X j∈Lk xj nj , (2.1) where nj is the number of outgoing links from page j (which must be positive since if j ∈ Lk then 1A graph consists of a set of vertices (in this context, the web pages) and a set of edges. Each edge joins a pair of vertices. The graph is undirected if the edges have no direction. The graph is directed if each edge (in the web context, the links) has a direction, that is, a starting and ending vertex

THE S25.000.000.000 EIGENVECTOR 3 2 FIG.2.2.A web of five pages,consisting of two disconnected "subwebs"W1 (pages 1 and 2)and W2 (pages 3, 4,5). page j links to at least page k!).We will assume that a link from a page to itself will not be counted. In this "democracy of the web"you don't get to vote for yourself! Let's apply this approach to the four-page web of Figure 2.1.For page 1 we have x1=z3/1+ x4/2,since pages 3 and 4 are backlinks for page 1 and page 3 contains only one link,while page 4 contains two links (splitting its vote in half).Similarly,x2 =x1/3,3 =1/3+x2/2+x4/2,and x4=z1/3+72/2.These linear equations can be written Ax =x,where x=[r1 z2 3 r4]T and 001 0 (2.2) This transforms the web ranking problem into the "standard"problem of finding an eigenvector for a square matrix!(Recall that the eigenvalues A and eigenvectors x of a matrix A satisfy the equation Ax =Ax,x0 by definition.)We thus seek an eigenvector x with eigenvalue 1 for the matrix A.We will refer to A as the "link matrix"for the given web. It turns out that the link matrix A in equation(2.2)does indeed have eigenvectors with eigen- value 1,namely,all multiples of the vector [12 4 9 6]T (recall that any non-zero multiple of an eigenvector is again an eigenvector).Let's agree to scale these "importance score eigenvectors" so that the components sum to1.in this case we obtain1=号≈0.387，x2=≈0.129， x3-是≈0.290，andx4=7≈0.l94.Note that this ranking differs from that generated by simply counting backlinks.It might seem surprising that page 3,linked to by all other pages,is not the most important.To understand this,note that page 3 links only to page 1 and so casts its entire vote for page 1.This,with the vote of page 2,results in page 1 getting the highest importance score. More generally,the matrix A for any web must have 1 as an eigenvalue if the web in question has no dangling nodes (pages with no outgoing links).To see this,first note that for a general web of n pages formula(2.1)gives rise to a matrix A with Aij =1/nj if page j links to page i,Aij=0 otherwise.The jth column of A then contains nj non-zero entries,each equal to 1/nj,and the column thus sums to 1.This motivates the following definition,used in the study of Markov chains: DEFINITION 2.1.A square matrix is called a column-stochastic matrit if all of its entries are nonnegative and the entries in each column sum to one. The matrix A for a web with no dangling nodes is column-stochastic.We now prove PROPOSITION 1.Every column-stochastic matriz has 1 as an eigenvalue.Proof.Let A be an n x n column-stochastic matrix and let e denote an n dimensional column vector with all entries equal to 1.Recall that A and its transpose AT have the same eigenvalues.Since A is column-stochastic it is easy to see that ATe=e,so that 1 is an eigenvalue for AT and hence for A. In what follows we use V(A)to denote the eigenspace for eigenvalue 1 of a column-stochastic matrix A. 2.2.Shortcomings.Several difficulties arise with using formula(2.1)to rank websites.In this section we discuss two issues:webs with non-unique rankings and webs with dangling nodes. 2.2.1.Non-Unique Rankings.For our rankings it is desirable that the dimension of Vi(A) equal one,so that there is a unique eigenvector x with >iri=1 that we can use for importance scores.This is true in the web of Figure 2.1 and more generally is always true for the special case of

THE $25,000,000,000 EIGENVECTOR 3 5 2 4 1 3 Fig. 2.2. A web of five pages, consisting of two disconnected “subwebs” W1 (pages 1 and 2) and W2 (pages 3, 4, 5). page j links to at least page k!). We will assume that a link from a page to itself will not be counted. In this “democracy of the web” you don’t get to vote for yourself! Let’s apply this approach to the four-page web of Figure 2.1. For page 1 we have x1 = x3/1 + x4/2, since pages 3 and 4 are backlinks for page 1 and page 3 contains only one link, while page 4 contains two links (splitting its vote in half). Similarly, x2 = x1/3, x3 = x1/3 + x2/2 + x4/2, and x4 = x1/3 + x2/2. These linear equations can be written Ax = x, where x = [x1 x2 x3 x4] T and A =     0 0 1 1 2 1 3 0 0 0 1 3 1 2 0 1 2 1 3 1 2 0 0     . (2.2) This transforms the web ranking problem into the “standard” problem of finding an eigenvector for a square matrix! (Recall that the eigenvalues λ and eigenvectors x of a matrix A satisfy the equation Ax = λx, x 6= 0 by definition.) We thus seek an eigenvector x with eigenvalue 1 for the matrix A. We will refer to A as the “link matrix” for the given web. It turns out that the link matrix A in equation (2.2) does indeed have eigenvectors with eigenvalue 1, namely, all multiples of the vector [12 4 9 6]T (recall that any non-zero multiple of an eigenvector is again an eigenvector). Let’s agree to scale these “importance score eigenvectors” so that the components sum to 1. In this case we obtain x1 = 12 31 ≈ 0.387, x2 = 4 31 ≈ 0.129, x3 = 9 31 ≈ 0.290, and x4 = 6 31 ≈ 0.194. Note that this ranking differs from that generated by simply counting backlinks. It might seem surprising that page 3, linked to by all other pages, is not the most important. To understand this, note that page 3 links only to page 1 and so casts its entire vote for page 1. This, with the vote of page 2, results in page 1 getting the highest importance score. More generally, the matrix A for any web must have 1 as an eigenvalue if the web in question has no dangling nodes (pages with no outgoing links). To see this, first note that for a general web of n pages formula (2.1) gives rise to a matrix A with Aij = 1/nj if page j links to page i, Aij = 0 otherwise. The jth column of A then contains nj non-zero entries, each equal to 1/nj , and the column thus sums to 1. This motivates the following definition, used in the study of Markov chains: Definition 2.1. A square matrix is called a column-stochastic matrix if all of its entries are nonnegative and the entries in each column sum to one. The matrix A for a web with no dangling nodes is column-stochastic. We now prove Proposition 1. Every column-stochastic matrix has 1 as an eigenvalue. Proof. Let A be an n×n column-stochastic matrix and let e denote an n dimensional column vector with all entries equal to 1. Recall that A and its transpose AT have the same eigenvalues. Since A is column-stochastic it is easy to see that AT e = e, so that 1 is an eigenvalue for AT and hence for A. In what follows we use V1(A) to denote the eigenspace for eigenvalue 1 of a column-stochastic matrix A. 2.2. Shortcomings. Several difficulties arise with using formula (2.1) to rank websites. In this section we discuss two issues: webs with non-unique rankings and webs with dangling nodes. 2.2.1. Non-Unique Rankings. For our rankings it is desirable that the dimension of V1(A) equal one, so that there is a unique eigenvector x with P i xi = 1 that we can use for importance scores. This is true in the web of Figure 2.1 and more generally is always true for the special case of

4 K.BRYAN AND T.LEISE a strongly connected web(that is,you can get from any page to any other page in a finite number of steps);see Exercise 10 below. Unfortunately,it is not always true that the link matrix A will yield a unique ranking for all webs.Consider the web in Figure 2.2,for which the link matrix is 「01000 10000 0001 0 010 00000 We find here that Vi(A)is two-dimensional;one possible pair of basis vectors isx=[1/2,1/2,0,0,0]T and y=[0,0,1/2,1/2,0]T.But note that any linear combination of these two vectors yields another vector in Vi(A),e.g.,x+iy =[3/8,3/8,1/8,1/8,0]T.It is not clear which,if any,of these eigenvectors we should use for the rankings! It is no coincidence that for the web of Figure 2.2 we find that dim(V(A))>1.It is a consequence of the fact that if a web W,considered as an undirected graph(ignoring which direction each arrows points),consists of r disconnected subwebs Wi,...,Wr,then dim(Vi(A))>r,and hence there is no unique importance score vector xE Vi(A)with >zi=1.This makes intuitive sense:if a web W consists of r disconnected subwebs Wi,...,Wr then one would expect difficulty in finding a common reference frame for comparing the scores of pages in one subweb with those in another subweb. Indeed,it is not hard to see why a web W consisting of r disconnected subwebs forces dim(Vi(A))> r.Suppose a web W has n pages and r component subwebs W1,...,W.Let ni denote the number of pages in Wi.Index the pages in Wi with indices 1 through ni,the pages in W2 with indices n1+1 through n1+n2,the pages in W3 with ni+n2 +1 through n+n2 +n3,etc.In general, let Nnj for i1,with No-0,so Wi contains pages N-1+1 through N.For example, in the web of Figure 2 we can take N=2 and N2=5,so Wi contains pages 1 and 2,W2 contains pages 3,4,and 5.The web in Figure 2.2 is a particular example of the general case,in which the matrix A assumes a block diagonal structure A1 0 0 0 A2 0 A 0 0 00 where Ai denotes the link matrix for Wi.In fact,Wi can be considered as a web in its own right. Each nix ni matrix Ai is column-stochastic,and hence possesses some eigenvector vi ER"with eigenvector 1.For each i between 1 and r construct a vector wiR"which has 0 components for all elements corresponding to blocks other than block i.For example, 0 0 v2 0 0 0 0 Then it is easy to see that the vectors w,1 <i<r,are linearly independent eigenvectors for A

4 K. BRYAN AND T. LEISE a strongly connected web (that is, you can get from any page to any other page in a finite number of steps); see Exercise 10 below. Unfortunately, it is not always true that the link matrix A will yield a unique ranking for all webs. Consider the web in Figure 2.2, for which the link matrix is A =       0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 2 0 0 1 0 1 2 0 0 0 0 0       . We find here that V1(A) is two-dimensional; one possible pair of basis vectors is x = [1/2, 1/2, 0, 0, 0]T and y = [0, 0, 1/2, 1/2, 0]T . But note that any linear combination of these two vectors yields another vector in V1(A), e.g., 3 4 x + 1 4 y = [3/8, 3/8, 1/8, 1/8, 0]T . It is not clear which, if any, of these eigenvectors we should use for the rankings! It is no coincidence that for the web of Figure 2.2 we find that dim(V1(A)) > 1. It is a consequence of the fact that if a web W, considered as an undirected graph (ignoring which direction each arrows points), consists of r disconnected subwebs W1, . . . , Wr, then dim(V1(A)) ≥ r, and hence there is no unique importance score vector x ∈ V1(A) with P i xi = 1. This makes intuitive sense: if a web W consists of r disconnected subwebs W1, . . . , Wr then one would expect difficulty in finding a common reference frame for comparing the scores of pages in one subweb with those in another subweb. Indeed, it is not hard to see why a web W consisting of r disconnected subwebs forces dim(V1(A)) ≥ r. Suppose a web W has n pages and r component subwebs W1, . . . , Wr. Let ni denote the number of pages in Wi . Index the pages in W1 with indices 1 through n1, the pages in W2 with indices n1 + 1 through n1 + n2, the pages in W3 with n1 + n2 + 1 through n1 + n2 + n3, etc. In general, let Ni = Pi j=1 nj for i ≥ 1, with N0 = 0, so Wi contains pages Ni−1 + 1 through Ni . For example, in the web of Figure 2 we can take N1 = 2 and N2 = 5, so W1 contains pages 1 and 2, W2 contains pages 3, 4, and 5. The web in Figure 2.2 is a particular example of the general case, in which the matrix A assumes a block diagonal structure A =      A1 0 . . . 0 0 A2 0 0 0 . . . . . . 0 0 0 0 Ar      , where Ai denotes the link matrix for Wi . In fact, Wi can be considered as a web in its own right. Each ni × ni matrix Ai is column-stochastic, and hence possesses some eigenvector v i ∈ lRni with eigenvector 1. For each i between 1 and r construct a vector wi ∈ lRn which has 0 components for all elements corresponding to blocks other than block i. For example, w1 =   v 1 0 0 . . . 0   , w2 =   0 v 2 0 . . . 0   , . . . Then it is easy to see that the vectors wi , 1 ≤ i ≤ r, are linearly independent eigenvectors for A

THE S25.000.000.000 EIGENVECTOR 5 with eigenvalue 1 because 0 .. 0 Awi=A v 0 0 Thus V(A)has dimension at least r. 2.2.2.Dangling Nodes.Another difficulty may arise when using the matrix A to generate rankings.A web with dangling nodes produces a matrix A which contains one or more columns of all zeros.In this case A is column-substochastic,that is,the column sums of A are all less than or equal to one.Such a matrix must have all eigenvalues less than or equal to 1 in magnitude,but 1 need not actually be an eigenvalue for A.Nevertheless,the pages in a web with dangling nodes can still be ranked use a similar technique.The corresponding substochastic matrix must have a positive eigenvalue A<1 and a corresponding eigenvector x with non-negative entries (called the Perron eigenvector)that can be used to rank the web pages.See Exercise 4 below.We will not further consider the problem of dangling nodes here,however. EXERCISE 1.Suppose the people who own page 3 in the web of Figure 1 are infuriated by the fact that its importance score,computed using formula(2.1),is lower than the score of page 1.In an attempt to boost page 3's score,they create a page 5 that links to page 3;page 3 also links to page 5.Does this boost page 3's score above that of page 1? EXERCISE 2.Construct a web consisting of three or more subwebs and verify that dim(Vi(A)) equals (or exceeds)the number of the components in the web. EXERCISE 3.Add a link from page 5 to page 1 in the web of Figure 2.The resulting web, considered as an undirected graph,is connected.What is the dimension of vi(A)? EXERCISE 4.In the web of Figure 2.1,remove the link from page 3 to page 1.In the resulting web page 3 is now a dangling node.Set up the corresponding substochastic matrix and find its largest positive (Perron)eigenvalue.Find a non-negative Perron eigenvector for this eigenvalue,and scale the vector so that components sum to one.Does the resulting ranking seem reasonable? EXERCISE 5.Prove that in any web the importance score of a page with no backlinks is zero. EXERCISE 6.Implicit in our analysis up to this point is the assertion that the manner in which the pages of a web W are indered has no effect on the importance score assigned to any given page. Prove this,as follows:Let W contains n pages,each page assigned an inder 1 through n,and let A be the resulting link matrir.Suppose we then transpose the indices of pages i and j (so page i is now page j and vice-versa).Let A be the link matrir for the relabelled web. Argue that A=PAP,where P is the elementary matrix obtained by transposing rows i and j of the n x n identity matrix.Note that the operation APA has the effect of swapping rows i and j of A,while A-AP swaps columns i and j.Also,P2 =I,the identity matrir. Suppose that x is an eigenvector for A,so Ax=Xx for some A.Show that y =Px is an eigenvector for A with eigenvalue A. Erplain why this shows that transposing the indices of any two pages leaves the importance scores unchanged,and use this result to argue that any permutation of the page indices leaves the importance scores unchanged. 3.A remedy for dim(V(A))>1.An enormous amount of computing resources are needed to determine an eigenvector for the link matrix corresponding to a web containing billions of pages. It is thus important to know that our algorithm will yield a unique set of sensible web rankings. The analysis above shows that our first attempt to rank web pages leads to difficulties if the web isn't connected.And the worldwide web,treated as an undirected graph,contains many disjoint components;see [9]for some interesting statistics concerning the structure of the web

THE $25,000,000,000 EIGENVECTOR 5 with eigenvalue 1 because Awi = A   0 . . . 0 v i 0 . . . 0   = wi . Thus V1(A) has dimension at least r. 2.2.2. Dangling Nodes. Another difficulty may arise when using the matrix A to generate rankings. A web with dangling nodes produces a matrix A which contains one or more columns of all zeros. In this case A is column-substochastic, that is, the column sums of A are all less than or equal to one. Such a matrix must have all eigenvalues less than or equal to 1 in magnitude, but 1 need not actually be an eigenvalue for A. Nevertheless, the pages in a web with dangling nodes can still be ranked use a similar technique. The corresponding substochastic matrix must have a positive eigenvalue λ ≤ 1 and a corresponding eigenvector x with non-negative entries (called the Perron eigenvector) that can be used to rank the web pages. See Exercise 4 below. We will not further consider the problem of dangling nodes here, however. Exercise 1. Suppose the people who own page 3 in the web of Figure 1 are infuriated by the fact that its importance score, computed using formula (2.1), is lower than the score of page 1. In an attempt to boost page 3’s score, they create a page 5 that links to page 3; page 3 also links to page 5. Does this boost page 3’s score above that of page 1? Exercise 2. Construct a web consisting of three or more subwebs and verify that dim(V1(A)) equals (or exceeds) the number of the components in the web. Exercise 3. Add a link from page 5 to page 1 in the web of Figure 2. The resulting web, considered as an undirected graph, is connected. What is the dimension of V1(A)? Exercise 4. In the web of Figure 2.1, remove the link from page 3 to page 1. In the resulting web page 3 is now a dangling node. Set up the corresponding substochastic matrix and find its largest positive (Perron) eigenvalue. Find a non-negative Perron eigenvector for this eigenvalue, and scale the vector so that components sum to one. Does the resulting ranking seem reasonable? Exercise 5. Prove that in any web the importance score of a page with no backlinks is zero. Exercise 6. Implicit in our analysis up to this point is the assertion that the manner in which the pages of a web W are indexed has no effect on the importance score assigned to any given page. Prove this, as follows: Let W contains n pages, each page assigned an index 1 through n, and let A be the resulting link matrix. Suppose we then transpose the indices of pages i and j (so page i is now page j and vice-versa). Let A˜ be the link matrix for the relabelled web. • Argue that A˜ = PAP, where P is the elementary matrix obtained by transposing rows i and j of the n × n identity matrix. Note that the operation A → PA has the effect of swapping rows i and j of A, while A → AP swaps columns i and j. Also, P2 = I, the identity matrix. • Suppose that x is an eigenvector for A, so Ax = λx for some λ. Show that y = Px is an eigenvector for A˜ with eigenvalue λ. • Explain why this shows that transposing the indices of any two pages leaves the importance scores unchanged, and use this result to argue that any permutation of the page indices leaves the importance scores unchanged. 3. A remedy for dim(V1(A)) > 1. An enormous amount of computing resources are needed to determine an eigenvector for the link matrix corresponding to a web containing billions of pages. It is thus important to know that our algorithm will yield a unique set of sensible web rankings. The analysis above shows that our first attempt to rank web pages leads to difficulties if the web isn’t connected. And the worldwide web, treated as an undirected graph, contains many disjoint components; see [9] for some interesting statistics concerning the structure of the web

点击进入文档下载页（PDF格式）

共11页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录