WHAT WILL WE TALK? 1.Locality-Sensitive Hashing (Shingling+MinHash) 2.Learning to Hash 7
7 1. Locality-Sensitive Hashing (Shingling+ MinHash) 2. Learning to Hash WHAT WILL WE TALK?
Locality-Sensitive Hashing Find Similar Items
Locality-Sensitive Hashing Find Similar Items
Introduction Many Web-mining problems can be expressed as finding "similar"sets: 1.Pages with similar words,e.g.,for classification by topic. 2.NetFlix users with similar tastes in movies,for recommendation systems. 3.Movies with similar sets of fans. 4.Images of related things. 9
9 Many Web-mining problems can be expressed as finding “similar” sets: 1. Pages with similar words, e.g., for classification by topic. 2. NetFlix users with similar tastes in movies, for recommendation systems. 3. Movies with similar sets of fans. 4. Images of related things. Introduction Introduction
CASE STUDY Finding Similar Documents
CASE STUDY Finding Similar Documents
Given a body of documents,e.g.,the Web,find pairs of documents with a lot of text in common, e.g.: -Mirror sites,or approximate mirrors. Application:Don't want to show both in a search -Plagiarism,including large quotations. -Similar news articles at many news sites. Application:Cluster articles by "same story." 11
11 • Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.: – Mirror sites, or approximate mirrors. • Application: Don’t want to show both in a search. – Plagiarism, including large quotations. – Similar news articles at many news sites. • Application: Cluster articles by “same story.” Introduction