Context: Topic Models and Word Embeddings Topic Modeling(blei et al. 2003 Topics Documents Topic proportions and assignments tenetic0.自1 Seeking Life's Bare(Genetic) Necessities COLD NIN HARn. NEW YOur=“出m面由f甲 e La wel a the re BoDw中 w tue l I life evolv : oran 器 出“时 n Ilsla Molel Gaone brain h时 neuro w IwnAh't l ewh nerve Ahlesu h tl Hamden Lm LE L dat a nunter sIN1.V 14.24 MAY IN computer Figure source: Blei, D M.(2012). Probabilistic topic models. Communications of the ACM, 55(4),77-84
Context: Topic Models and Word Embeddings • Topic Modeling (Blei et al., 2003) 6
Context: Topic Models and Word Embeddings · Word embedding Softmax classifier Word2vec(Nikolov et al. 13 Glove(Pennington et al. 14 Matrix factorization ∑ embedding (Deerwester 90; Levy et al 15 Projection layer the cat sits on themat Italy Mad Germany walked Berlin swam Russ⊥ walki Canada v⊥ etna Hanoi Male-Female Verb tense Country-Capital https://www.tensorflow.org/versions/ro.7/tutorials/word2vec/index.html
Context: Topic Models and Word Embeddings • Word embedding – Word2vec (Mikolov et al., 13) – Glove (Pennington et al., 14) – Matrix factorization (Deerwester’90;Levy et al., 15) – … https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html 7
What's Missing · The semantics of entities and their relatⅰons Ohama On Feb 10, 2007 Obama announced his candidacy for President of the United St old State front of the Old State Capitol located in portrayed passionate Bush portrayed himself as a compassionate conservative, implying he was more suitable Republicans than other Republicans to go to lead the United States Bush What can context cover New york ys. New york times What cannot? George Washington "VS. Washington Higher order relations Affiliation In Affiliation In Contains Contains Document- Basketbal‖l NBA Basketball -Document Documentsontains Conte Basketball Olympics Basketball Document
What’s Missing? 8 • The semantics of entities and their relations • What can context cover? • What cannot? – Higher order relations ``New York'' vs. ``New York Times'' ``George Washington'' vs. ``Washington'' Document Basketball NBA Basketball Document Contains Contains Affiliation In Affiliation In Document Basketball Olympics Basketball Document Contains Contains
Outline Text Analytics: Motivation Two Challenges Representation Labels Text Categorization via hin HIN cOnstruction from texts From hin similarity to clustering and classification World knowledge indirect supervision Conclusions and future work
Outline • Text Analytics: Motivation – Two Challenges • Representation • Labels • Text Categorization via HIN – HIN construction from texts – From HIN similarity to clustering and classification – World knowledge indirect supervision • Conclusions and future work 9
Acquire Labeled data Expert Semi-supervised Annotation Crowdsourcing /transfer learning f Fast changing domains so amazon mechanical turk Baic百度 HERE smartart t/cheek ToCheek freelancer amazon YAH Simple tasks Many diverse domains Only big companies can Media Aceris hire a lot of experts Low quality Costl Still costly Domain dependent 10
Acquire Labeled Data Expert Annotation Costly Crowdsourcing Simple tasks Low quality Still costly Semi-supervised /transfer learning Domain dependent Many diverse domains Fast changing domains Only big companies can hire a lot of experts 10