Three Essential Techniques for Similar Documents 1.Shingling:convert documents,emails,etc.,to sets. 2.Minhashing convert large sets to short signatures,while preserving similarity. 3.Locality-sensitive hashing focus on pairs of signatures likely to be similar. 12
12 Three Essential Techniques for Similar Documents 1. Shingling : convert documents, emails, etc., to sets. 2. Minhashing : convert large sets to short signatures, while preserving similarity. 3. Locality-sensitive hashing : focus on pairs of signatures likely to be similar. Introduction
The Big Picture Candidate Locality- pairs: Document Shingling Minhash- those pairs sensitive ing Hashing of signatures that we need to test for The set Signatures similarity of strings short integer of length k vectors that that appear represent the in the document sets,and reflect their similarity 13
13 The Big Picture Document The set of strings of length k that appear in the document Signatures : short integer vectors that represent the sets, and reflect their similarity Localitysensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Introduction
Shingles Ak-shingle (or k-gram)for a document is a sequence of k characters that appears in the document. Example:k=2;doc abcab.Set of 2-shingles={ab, bc,ca. -Option:regard shingles as a bag,and count ab twice. Represent a doc by its set of k-shingles. 14
14 Shingles • A k -shingle (or k -gram) for a document is a sequence of k characters that appears in the document. • Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}. – Option: regard shingles as a bag, and count ab twice. • Represent a doc by its set of k-shingles. Shingling
Assumption Documents that have lots of shingles in common have similar text.even if the text appears in different order. Careful:you must pick k large enough,or most documents will have most shingles. -k=5 is OK for short documents:k=10 is better for long documents. 15
15 Assumption • Documents that have lots of shingles in common have similar text, even if the text appears in different order. • Careful: you must pick k large enough, or most documents will have most shingles. – k = 5 is OK for short documents; k = 10 is better for long documents. Shingling
Min-Hashing
Min-Hashing