Link Analysis Mining Massive Datasets Wu-Jun li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets
Link Analysis Link Analysis algorithms PageRank Hubs and authorities Topic-Sensitive PageRank Spam Detection Algorithms Other interesting topics we wont cover Detecting duplicates and mirrors Mining for communities community detection (Refer to Chapter 10 of the textbook)
Link Analysis 2 Link Analysis Algorithms ▪ PageRank ▪ Hubs and Authorities ▪ Topic-Sensitive PageRank ▪ Spam Detection Algorithms ▪ Other interesting topics we won’t cover ▪ Detecting duplicates and mirrors ▪ Mining for communities (community detection) (Refer to Chapter 10 of the textbook)
Link Analysis Outline PageRank Topic-Sensitive PageRank Hubs and authorities Spam detection
Link Analysis 3 Outline ▪ PageRank ▪ Topic-Sensitive PageRank ▪ Hubs and Authorities ▪ Spam Detection
Link Analysis PageRank Ranking web pages Web pages are not equally important www.joe-schmoe.comvwww.stanford.edu Inlinks as votes www.stanford.eduhas23,400inlinks www.joe-schmoe.comhas1inlink Are all inlinks equal? Recursive question!
Link Analysis 4 Ranking web pages ▪ Web pages are not equally “important” ▪ www.joe-schmoe.com v www.stanford.edu ▪ Inlinks as votes ▪ www.stanford.edu has 23,400 inlinks ▪ www.joe-schmoe.com has 1 inlink ▪ Are all inlinks equal? ▪ Recursive question! PageRank
Link Analysis PageRank Simple recursive formulation Each link's vote is proportional to the importance of Its source page If page P with importance x has n outlinks, each link gets x/n votes Page p's own importance is the sum of the votes on its inlinks
Link Analysis 5 Simple recursive formulation ▪ Each link’s vote is proportional to the importance of its source page ▪ If page P with importance x has n outlinks, each link gets x/n votes ▪ Page P’s own importance is the sum of the votes on its inlinks PageRank