How to represent a document Represent by a string? University of virginia From Wikipedia, the free encyclopedia IE The University of Virginia(UVA or U. Va.), often referred to as simply Virginia, is a public research university in R Charlottesville, Virginia. UVA is known for its historic foundations, student-run honor code, and secret societies Its initial Board of visitors included U.S. Presidents Thomas Jefferson, James Madison, and James monroe President Monroe was the sitting President of the United States at the time of the founding, Jefferson and rsive Madison were the first two rectors. UVA was established in 1819, with its Academical Village and original courses of study conceived and designed entirely by Jefferson. UNESCO designated it a World Heritage Site in 1987, an honor shared with nearby Monticello. 14 The first university of the American South elected to the Association of American Universities in 1904, UVA classified as very High Research Activity in the Carnegie Classification. The university is affiliated with 7 Nobel Laureates, and has produced 7 NASA astronauts, 7 Marshall Scholars, 4 Churchill Scholars, 29 Truman Scholars, and 50 Rhodes Scholars, the most of any state-affiliated institution in the U.S. 516/ Supported in part by the Commonwealth. it receives far more funding from private sources than public, and its students come from all 50 states and 147 countries (2819) It also operates a small liberal arts branch campus in the far southwestern corner of the state CSoUVa CS 6501: Text Mining
How to represent a document • Represent by a string? – No semantic meaning • Represent by a list of sentences? – Sentence is just like a short document (recursive definition) CS@UVa CS 6501: Text Mining 16
Vector space mode Represent documents by concept vectors Each concept defines one dimension k concepts define a high-dimensional space Element of vector corresponds to concept weight E.g. , d=(x,.,Xk),Xi is"importance"of concept i in d Distance between the vectors in this concept space Relationship among documents CSoUVa CS 6501: Text Mining
Vector space model • Represent documents by concept vectors – Each concept defines one dimension – k concepts define a high-dimensional space – Element of vector corresponds to concept weight • E.g., d=(x1 ,…,xk ), xi is “importance” of concept i in d • Distance between the vectors in this concept space – Relationship among documents CS@UVa CS 6501: Text Mining 17
An illustration of vs model all documents are projected into this concept space Finance D ID2-De Sports Education CSoUVa CS 6501: Text Mining
An illustration of VS model • All documents are projected into this concept space Sports Education Finance D4 D2 D1 D5 D3 CS@UVa CS 6501: Text Mining 18 |D2 -D4|
What the vS model doesnt say How to define select the"basic concept Concepts are assumed to be orthogonal How to assign weights Weights indicate how well the concept characterizes the document How to define the distance metric CSoUVa CS 6501: Text Mining
What the VS model doesn’t say • How to define/select the “basic concept” – Concepts are assumed to be orthogonal • How to assign weights – Weights indicate how well the concept characterizes the document • How to define the distance metric CS@UVa CS 6501: Text Mining 19
Wh at is a good"Basic Concept Orthogonal -Linearly independent basis vectors Non-overlapping in meaning No ambiguity Weights can be assigned automatically and accuratel Existing solutions Terms or N-grams, a.k.a., Bag-of-Words opics t We will come back to this later CSoUVa CS 6501: Text Mining
What is a good “Basic Concept”? • Orthogonal – Linearly independent basis vectors • “Non-overlapping” in meaning – No ambiguity • Weights can be assigned automatically and accurately • Existing solutions – Terms or N-grams, a.k.a., Bag-of-Words – Topics We will come back to this later CS@UVa CS 6501: Text Mining 20