Challenges in text mining Data collection is "free text" Data is not well-organized Semi-structured or unstructured Natural language text contains ambiguities on many levels Lexical, syntactic, semantic, and pragmatic Learning techniques for processing text typically need annotated training examples Expensive to acquire at scale · What to mine? CSoUVa CS6501: Text Mining
Challenges in text mining • Data collection is “free text” – Data is not well-organized • Semi-structured or unstructured – Natural language text contains ambiguities on many levels • Lexical, syntactic, semantic, and pragmatic – Learning techniques for processing text typically need annotated training examples • Expensive to acquire at scale • What to mine? CS@UVa CS6501: Text Mining 11
Text mining problems we will solve Document categorization Adding structures to the text corpus Taxonom Economy international Stock Exchange USA Asia Classificatio Documents CSoUVa CS6501: Text Mining
Text mining problems we will solve • Document categorization – Adding structures to the text corpus CS@UVa CS6501: Text Mining 12
Text mining problems we will solve Text clustering Identifying structures in the text corpus Molecular Neural口 Cellular Systems NCCAM Biology NCMHD nce口NHGR sciences □NEHs NIGMS ■NMH □NNDs ealth Services□NLM Behavior CSoUVa CS6501: Text Mining
Text mining problems we will solve • Text clustering – Identifying structures in the text corpus CS@UVa CS6501: Text Mining 13
Text mining problems we will solve Topic modeling Identifying structures in the text corpus Topics Documents Topic proportions and assignments genetic 0.01 Seeking Life's Bare(Genetic) Necessities lt.62 rar view, af the basic genes nsds evolve 8. 81 One reward h teat, Int organism 8.01 lte fonn, an new mared that for thian brain 6. 04 wegener are plenty to dothe uron nerve Cold Spring Harbor. New York ping down Compm ur analyas yield an cst May to number 6. 02 sIEN(E·1272·:4MAY19 ter.母1 CSoUVa CS6501: Text Mining
Text mining problems we will solve • Topic modeling – Identifying structures in the text corpus CS@UVa CS6501: Text Mining 14
Text Document Representation Language model 1. How to represent a document? Make it computable 2. How to infer the relationship among documents or identify the structure within a document? Knowledge discovery 3. Language Model and N-Grams CSoUVa CS 6501: Text Mining
Text Document Representation & Language Model 1. How to represent a document? – Make it computable 2. How to infer the relationship among documents or identify the structure within a document? – Knowledge discovery 3. Language Model and N-Grams CS@UVa CS 6501: Text Mining 15