Probabilistic Latent Semantic analysis To appear in: Uncertainity in Artificial Intelligence, UAI99, Stockholm EECS Dep artment, Computer Science Division, Uni versity of California, Berkeley International Computer Science Institute, Berkeley, CA hofmann @cs. berkeley. edu abstract was referred to in a text or an utterance. The result plems are twofold: (i) poly sems, i.e. a word Probabilistic Latent Semantic Analysis is a may have multiple senses and multiple ty pes of usage novel statistical techni que for the analy sis in different context, and (ii) synonyms and semanti of two-mode and co-occurrence data which cally related words, i. e. di fferent words may have a has applications in information retrieval and simil ar meaning, they may at least in certain contexts filtering, natural language processing, ma enote the same concept or -in a weaker sense refer chine le fr to the same topic eas. Comp ared to standard Latent Semantic Latent semantic analysis(LSa)[3 is well-known tecl Analysis which stems from linear algebra and ique which partially addresses these questions. The performs a Singul ar Value Decomposition of key idea is to map high-dimensional count vectors occurrence tables, the proposed method s based on a mixture decomp osition derived tions of text do cuments [12, to a lower dimensional from a latent class model. This results in a represent ation in a so-called latent semantic space. As more principled approach which has a solid foundation in statistics. In order to avoid the name suggests, the goal of lsa is to find a data mapping which provides information well beyond the over fit ting, we prop ose a widely applicable lexical level and reveals semantical relations betwee eralization of maximum likelihood model fitting by tempered EM. Our ap proach yields the entities of interest. Due to its generality, LsA has proven to be a valuable analysis tool with a wide substantial and consistent improvements over range of applications(e.g. 3, 5, 8, 1). Yet its theoreti Latent Semantic Analysis in a number of ex- cal foundation remains to a large extent unsatis factory perIl This paper presents a st atistical view on Lsa which 1 Introduction mo del called Probabilistic late nt Se tics Analysis (PLSA). In cont Learning from text and natural language is one of the LSA, its probabilistic variant has a sound statistical eat challenges of Artificial Intelligence and Machine foundation and defines a proper generative mo del of Learning. Any subst antial progress in this domain has the data. a detailed discussion of the numerous ad- fro of plsa can be found in sub formation retrieval, information filtering, and intel processing, and machine translation. One of the fun- 2 latent Semantic Analysis gent inter faces, to speech recognition, natural language damental problems is to learn the meaning and usage of words in a data-driven fas hion from 2.1 Count data and co-occurrence tables text corpus, possibly without further linguistic prior Lsa can in principle be applied to any ty pe of count discrete dyadic domain(cf. [7). Ho The main challenge a machine learning system has to ever, since the most prominent application of lsa is address roots in the distinction between the lexical in the analysis and retrieval of text documents, we level of"what actually has been said or written"and focus on this setting for sake of concreteness. Sup the semantical level of"what w as intended"or"what pose therefore we have given a collection of text doc
Probabilistic Latent Semantic Analysis To appear in: Uncertainity in Articial Intelligence, UAI'99, Stockholm Thomas Hofmann EECS Department, Computer Science Division, University of California, Berkeley & International Computer Science Institute, Berkeley, CA hofmann@cs.berkeley.edu Abstract Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two{mode and co-occurrence data, which has applications in information retrieval and ltering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overtting, we propose a widely applicable generalization of maximum likelihood model tting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments. 1 Introduction Learning from text and natural language is one of the great challenges of Articial Intelligence and Machine Learning. Any substantial progress in this domain has strong impact on many applications ranging from information retrieval, information ltering, and intelligent interfaces, to speech recognition, natural language processing, and machine translation. One of the fundamental problems is to learn the meaning and usage of words in a data-driven fashion, i.e., from some given text corpus, possibly without further linguistic prior knowledge. The main challenge a machine learning system has to address roots in the distinction between the lexical level of \what actually has been said or written" and the semantical level of \what was intended" or \what was referred to" in a text or an utterance. The resulting problems are twofold: (i) polysems, i.e., a word may have multiple senses and multiple types of usage in dierent context, and (ii) synonymys and semantically related words, i.e., dierent words may have a similar meaning, they may at least in certain contexts denote the same concept or { in a weaker sense { refer to the same topic. Latent semantic analysis (LSA) [3] is well-known technique which partially addresses these questions. The key idea is to map high-dimensional count vectors, such as the ones arising in vector space representations of text documents [12], to a lower dimensional representation in a so-called latent semantic space. As the name suggests, the goal of LSA is to nd a data mapping which provides information well beyond the lexical level and reveals semantical relations between the entities of interest. Due to its generality, LSA has proven to be a valuable analysis tool with a wide range of applications (e.g. [3, 5, 8, 1]). Yet its theoretical foundation remains to a large extent unsatisfactory and incomplete. This paper presents a statistical view on LSA which leads to a new model called Probabilistic Latent Semantics Analysis (PLSA). In contrast to standard LSA, its probabilistic variant has a sound statistical foundation and denes a proper generative model of the data. A detailed discussion of the numerous advantages of PLSA can be found in subsequent sections. 2 Latent Semantic Analysis 2.1 Count Data and Co-occurrence Tables LSA can in principle be applied to any type of count data over a discrete dyadic domain (cf. [7]). However, since the most prominent application of LSA is in the analysis and retrieval of text documents, we focus on this setting for sake of concreteness. Suppose therefore we have given a collection of text doc-
ments=dI, .,dN) with terms from a vocab- (a) ulary W=wl,.wM). By ignoring the sequen tial order in which words occur in a document. one d may summarize the data in a nxM co-occurrence table of counts N=(n(di, wi)ii, where n(d,w)E N denotes how often the term w occurred in document d. In this particular case, N is also called the term- Figure 1: Graphical mo del representation of the as- document matrix and the rows/columns of n are re pect model in the asy mmetric(a)and symmetric(b. ferred to as document/term vectors, respectively. The parameterization vector-space representation [12] of documents will in many cases preserve most of the relevant informati efined by the mixture e g, for tasks like text retrieval based on keywor ds P(d,w)=P(d)P(wld), P(uld)=>P(wl:)(=1d ).(1) 2.2 Latent Semantic Analy sis by SVD Like virtually all statistical latent variable models the aspect model introduces a conditional independence As mentioned in the introduction the key idea of LSA assumption, namely that d and w are independent con is to map documents(and by symmetry terms)to a ditioned on the state of the asso ci ated latent variable vector space of reduced dimensionality, the latent se-(the corresponding graphical model represent ation is mantic space 3. The mapping is restricted to be lin depicted in Figure 1 Since the cardinality of a is ear and is based on a Singular Value Decomposition smaller than the number of documents/words in the (SVD) of the co-occurrence table. One thus starts ttleneck variable in predict with the standard Svd given by N= u2v, where ing words. It is worth noticing that the model can be U and V are orthogonal matrices U(-VV=I equivalently parameterized by(cf Figure 1(b)) and the diagonal matrix 2 contains the singul ar val ues of N. The LSA approximation of N is computed Pd,)=∑P()P(dl)Pml) (2) y setting all but the largest K singul ar values in 2 to zero(=3), which is rank K optimal in the sense which is perfectly symmetric in both entities, doct of the L2-matrix norm. One obtains the approxima- ments and words tion n=U∑vt≈UΣVt=N. Notice that the document-to-do cument inner products based on this 3.2 Mo del Fitting with the EM Algorithm approximation are given by nN=U2U and hence one might think of the rows of U2 as defining coor- The standard procedure for maximum likelihood es inates for do cuments in the latent space. While the timation in latent variable models is the Expectation original high-dimensional vectors are sp arse, the corre Maximiz ation(EM) algorithm 4. EM al ternate two sponding low-dimensional latent vectors will ty pically coupled steps: (i)an expect ation(E)step where poste- not be sp arse. This implies that it is possible to com- rior probabilities are computed for the latent variables, pute meaningful associ ation values between pairs of (ii) an maximization(M)step, where parameters are documents, even if the documents do not have any updated. Standard cal culations(cf [7, 13]) yield the terms in common. The hope is that terms having a E-step equation mon meaning, in particular synony ms, are rough )P(2 ped to the s ame direction in the latent sp ac ∑2szP(x)P(dl2)P(l) as well as the following M-step formulae 3 Probabilistic lsa Pl)∑m(d,)P(l,m),(4) 3.1 The Aspect Mo del P(d|)∝ The starting point for Probabilistic Latent Semantic Analys isis a statistical mo del which has been called P(x)∝ n(d,)P(zd,u).(6) aspect model [7. The aspect model is a latent variable model for co-occurrence data which associ ates an ur Before discussing further al gorithmic refinements, we observed cl ass variable zEZ=21, . *k) with each will study the relationship between the proposed observ ation. A joint probability model over Dx W is model and lsa in more detai
uments D = fd1; : : :;dN g with terms from a vocabulary W = fw1; : : :wMg. By ignoring the sequential order in which words occur in a document, one may summarize the data in a N M co-occurrence table of counts N = (n(di; wj))ij , where n(d; w) 2 IN denotes how often the term w occurred in document d. In this particular case, N is also called the termdocument matrix and the rows/columns of N are referred to as document/term vectors, respectively. The key assumption is that the simplied `bag-of-words' or vector-space representation [12] of documents will in many cases preserve most of the relevant information, e.g., for tasks like text retrieval based on keywords. 2.2 Latent Semantic Analysis by SVD As mentioned in the introduction, the key idea of LSA is to map documents (and by symmetry terms) to a vector space of reduced dimensionality, the latent semantic space [3]. The mapping is restricted to be linear and is based on a Singular Value Decomposition (SVD) of the co-occurrence table. One thus starts with the standard SVD given by N = UVt ; where U and V are orthogonal matrices UtU = VtV = I and the diagonal matrix contains the singular values of N. The LSA approximation of N is computed by setting all but the largest K singular values in to zero (= ~ ), which is rank K optimal in the sense of the L2-matrix norm. One obtains the approximation N~ = UV~ t UVt = N: Notice that the document-to-document inner products based on this approximation are given by N~ N~ t = U~ 2Ut and hence one might think of the rows of U~ as dening coordinates for documents in the latent space. While the original high-dimensional vectors are sparse, the corresponding low-dimensional latent vectors will typically not be sparse. This implies that it is possible to compute meaningful association values between pairs of documents, even if the documents do not have any terms in common. The hope is that terms having a common meaning, in particular synonyms, are roughly mapped to the same direction in the latent space. 3 Probabilistic LSA 3.1 The Aspect Model The starting point for Probabilistic Latent Semantic Analysis is a statistical model which has been called aspect model [7]. The aspect model is a latent variable model for co-occurrence data which associates an unobserved class variable z 2 Z = fz1; : : :;zKg with each observation. A joint probability model over DW is d z w z w P(z) (a) (b) d P(d) P(z|d) P(w|z) P(d|z) P(w|z) Figure 1: Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization. dened by the mixture P (d; w)=P (d)P (wjd); P (wjd)=X z2Z P (wjz)P (zjd): (1) Like virtually all statistical latent variable models the aspect model introduces a conditional independence assumption, namely that d and w are independent conditioned on the state of the associated latent variable (the corresponding graphical model representation is depicted in Figure 1 (a)). Since the cardinality of z is smaller than the number of documents/words in the collection, z acts as a bottleneck variable in predicting words. It is worth noticing that the model can be equivalently parameterized by (cf. Figure 1 (b)) P (d; w) = X z2Z P (z)P (djz)P (wjz) ; (2) which is perfectly symmetric in both entities, docu- ments and words. 3.2 Model Fitting with the EM Algorithm The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm [4]. EM alternates two coupled steps: (i) an expectation (E) step where posterior probabilities are computed for the latent variables, (ii) an maximization (M) step, where parameters are updated. Standard calculations (cf. [7, 13]) yield the E-step equation P (zjd; w) = P (z)P (djz)P (wjz) P z02Z P (z0)P (djz0)P (wjz0) ; (3) as well as the following M-step formulae P (wjz) / X d2D n(d; w)P (zjd; w); (4) P (djz) / X w2W n(d; w)P (zjd; w); (5) P (z) / X d2D X w2W n(d; w)P (zjd; w) : (6) Before discussing further algorithmic renements, we will study the relationship between the proposed model and LSA in more detail.
bedi hood function of mul tinomial samplin and the model. As is well known, this corresponds to minimization of the cross entropy or Kullback-Leibler type ofse deviation. On the modelin) side this offers important advanta]es, for example, the mixture approximation P(wlz,) Pof the co-occurrence table is a well-defined prob meanin/ In contr ast, LSa does not define a properl FiJure E: Sket ch of the prob ability sub-simplex cont ain ne] ative entries. In addition, there is no ob. ormalized probability distribution and n may even panned by the mode ous interpretation of the directions in the LSA latent space, while the directions in the PLSA space are in- ob abilistic latent Sem antic space tinomi al word distributions. T probabilistic approach can also take advanta]e of the Consider the class-conditional multinomial distribu- well-established st atistical theory or mo del selection tions P( over the vo cabulary w hich we call factors d complexity control, e.-, to determine the opti They can be represented as points on the M-1 di- mal number of latent space dimensions. Choosing the mensional simplex of all possible multinomials. Via number of dimensions in LSA on the other hand is its convex hull, this set of k points defines a L< typically based on ad hoc heuristics K-l dimensional sub-simplex. The modelin] assump- A comparison of the computational complexity mi)ht tion expressed by(1)is that conditional distributions su))est some advantages for LSA: i]norin) potenti al P(u ld) for all document are approximated by a multi- problems of numerical stability the SVD can be com- nomial representable as a convex combination of fac- puted exactly, while the EM al orithm is an iter ative method which is only Guaranteed to find a local max define a point on the spanned sub-simplex. A simple imum of the likelihood function. However, in all our sketch of this situation is show n in Fij ure E. Despite experiments the computin] time of EM has not been of the discreteness of the introduced latent variables, a si] nificantly worse than per formin) an SVD on the co cont in uous latent space is obtained within the space of occurrence matrix. There is also Je potential for all multinomial distributions. Since the dimensionality improvin] run-time per formance of EM by on- line up o the sub-simplex is K-1 as opposed to a maxil date schemes, which has not been explored so far oM-1 or the complete probability simplex, this per forms a dimensionality reduction in the space multi nomial distributions and the sp anned sub-simplex can 3.4 Topic Decomp osi tion an d polysemy be identified with a probabilistic latent semantic space. Let us briefly dis cuss some eluci datin) examples at To stress this point and to clarity the rel ation to this point which will also reveal a LSA, let us rewrite the aspect model as parameter- PLSA over LSA in the context o polsemous words We ized by(E) in matrix notation. Hence define ma- have Jenerated a dataset(CLUSTER)with abstracts trices by U =(P(dil=k i, k, V=(P(=k))j, k, and of 1568 documents on cluste ring and trained an aspect 2= dia](P(=k))e. The loint probability model P model with le8 latent cl asses. Four pairs of factors are can then be written as a matrix product P=U2 visualized in FiJure 3. These pairs have been selected Comp arin] this with SVD, one can make the follow- as the two factors that have the hi hest probability to ny observations: (i) outer products between rows o Generate the words"se]ment", "matrix U and V reflect conditional independence in PLSA,"power", respectively. The sketchy characterization of the k actors correspond to the mixture compo- the factors by their 10 most probable words already re ct model, (iii)the mixin) proportions stin te In particular. notice that the in PLSA substitute the sin ular values. The crucial term used to select a particul ar pair has a different difference between PLSA and LSA, however, is the meanin in either topic factor: ob jective function utilized to dete a]e re ion in the first and to a phonetic se ment econ os sitionoapproximation. In LSA, this is the L2- in the second actor.(ii)'Matrix denotes a rectan u or Frobenius norm, which corresponds to an implicit lar table of numbers and to a material in which some additive Gaussi an noise assumption on(possibly trans- thin is embedded or enclosed. (iii)'Line can re er to ormed )counts. In contrast, PLSa relies on the like- a line in an imae, but also to a line in a spectrum
+ P(w|d) P(w|z ) P(w|z )3 P(w|z ) spanned sub-simplex simplex 2 1 0 embedding KL divergence projection Figure 2: Sketch of the probability sub-simplex spanned by the aspect model. 3.3 Probabilistic Latent Semantic Space Consider the class-conditional multinomial distributions P (jz) over the vocabulary which we call factors. They can be represented as points on the M 1 dimensional simplex of all possible multinomials. Via its convex hull, this set of K points denes a L K1 dimensional sub-simplex. The modeling assumption expressed by (1) is that conditional distributions P (wjd) for all document are approximated byamultinomial representable as a convex combination of factors P (wjz), where the mixing weights P (zjd) uniquely dene a point on the spanned sub-simplex. A simple sketch of this situation is shown in Figure 2. Despite of the discreteness of the introduced latent variables, a continuous latent space is obtained within the space of all multinomial distributions. Since the dimensionality of the sub-simplex is K1 as opposed to a maximum of M 1 for the complete probability simplex, this performs a dimensionality reduction in the space of multinomial distributions and the spanned sub-simplex can be identied with a probabilistic latent semantic space. To stress this point and to clarify the relation to LSA, let us rewrite the aspect model as parameterized by (2) in matrix notation. Hence dene matrices by U^ = (P (dijzk))i;k, V^ = (P (wj jzk))j;k, and ^ = diag(P (zk))k. The joint probability model P can then be written as a matrix product P = U^ ^V^ t . Comparing this with SVD, one can make the following observations: (i) outer products between rows of U^ and V^ re ect conditional independence in PLSA, (ii) the K factors correspond to the mixture components in the aspect model, (iii) the mixing proportions in PLSA substitute the singular values. The crucial dierence between PLSA and LSA, however, is the objective function utilized to determine the optimal decomposition/approximation. In LSA, this is the L2- or Frobenius norm, which corresponds to an implicit additive Gaussian noise assumption on (possibly transformed) counts. In contrast, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model. As is well known, this corresponds to a minimization of the cross entropy or Kullback-Leibler divergence between the empirical distribution and the model, which is very dierent from any type of squared deviation. On the modeling side this oers important advantages, for example, the mixture approximation P of the co-occurrence table is a well-dened probability distribution and factors have a clear probabilistic meaning. In contrast, LSA does not dene a properly normalized probability distribution and N~ may even contain negative entries. In addition, there is no obvious interpretation of the directions in the LSA latent space, while the directions in the PLSA space are interpretable as multinomial word distributions. The probabilistic approach can also take advantage of the well-established statistical theory for model selection and complexity control, e.g., to determine the optimal number of latent space dimensions. Choosing the number of dimensions in LSA on the other hand is typically based on ad hoc heuristics. A comparison of the computational complexity might suggest some advantages for LSA: ignoring potential problems of numerical stability the SVD can be computed exactly, while the EM algorithm is an iterative method which is only guaranteed to nd a local maximum of the likelihood function. However, in all our experiments the computing time of EM has not been signicantly worse than performing an SVD on the cooccurrence matrix. There is also a large potential for improving run-time performance of EM by on-line update schemes, which has not been explored so far. 3.4 Topic Decomposition and Polysemy Let us brie y discuss some elucidating examples at this point which will also reveal a further advantage of PLSA over LSA in the context of polsemous words. We have generated a dataset (CLUSTER) with abstracts of 1568 documents on clustering and trained an aspect model with 128 latent classes. Four pairs of factors are visualized in Figure 3. These pairs have been selected as the two factors that have the highest probability to generate the words \segment", \matrix", \line", and \power", respectively. The sketchy characterization of the factors by their 10 most probable words already reveals interesting topics. In particular, notice that the term used to select a particular pair has a dierent meaning in either topic factor: (i) `Segment' refers to an image region in the rst and to a phonetic segment in the second factor. (ii) `Matrix' denotes a rectangular table of numbers and to a material in which something is embedded or enclosed. (iii) `Line' can refer to a line in an image, but also to a line in a spectrum
segment 1|“ segment2‖ matrix1”“ matrix2 ine1|ine2yll" power 1” I power2” manufactur constraint al pha POWER SEGMENT MATRIX LINE redshift‖ spectrun texture match LINE MATRIX locat galaxi POWER tissue quasar famili Input sIi source condition design high redshift perturb SEGMENT‖root format fundament densiti standard present volume veloc del Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most probable words in the class-conditional distribution P(w2), from top to bottom in descending order Document 1, P(d1, u;='segment)=(0.951, 0.0001,...) problem field imag analysi diagnost base proper SE GMENT digit imag SEGMENT medic imag need applic involv estim boundan ob ject classif tissu abnorm shape analysi cotour detec textur SEGmENT despit exist techniqu SEGmENT specif medic imag remain crural problem Document 2, Plz+dy,te 0.0250867,) P onsid signal ongin sequenc sourc specf problem SEGMENT signal rea IENT sourc address issu wide appic field report algonthm forward algonthm observ sequenc baumwelch train es tim hmm rain maten applic multipl signal sourc identif expen p Figure 4: Abstracts of 2 exemplary do cuments from the ClUSteR collection along with latent class posterior probabilities P=d, w='segment and word prob abilities Pw="segment'dy (iv)'Power'is used in the context of radiating ob jects clustering model [10, 7] which can be thought of n astronomy, but also in electrical engineering unsupervised version of a naive Bayes'classifier. It Figure 4 shows the abstr acts of two exemplary docu- can be show n that the conditional word probability of ments which have been pre-processed by a st andard a probabilistic clustering model is given by stop-word list and a stemmer. The posterior probabil- ties for the classes given the different occurrences of P)=∑P4l)=2}P() segment indicate how likely it is for each of the factors n the first pair of F where Pic(d)=a is the posterior probability of doc servation. We have also displayed the estimates of the ument d having latent class 2. It is a simple impli- conditional word probabilities Pw="segment'1d1, 23. cation of Bayes'rule that these posterior probabili One can see that the correct meaning of the word ties will concentrate their probability mass on a cer ment'is identified in both cases. This implies that th reasing number of observations though'segment' occurs frequently in both document, (ie, with the length of the document).This means e overlap in the factored representation is low, since that although(1)and(7)are algebraically equiva- segement'is identi fied as a polysemous word (relative lent, they are conceptually very different and yield in to the chosen resolution level)which -dependent on fact different results. The aspect model assumes that the context -is expl ained by different factors ent-specific distributions are 3.5 Aspects versus Clusters by all de It is worth comparing the aspect model with statistical clustering models the class-conditionals P(w=)have clustering models(cf. also [7). In clustering models In the dist ribut ional clustering model for do cuments, one typically associates a latent cl ass terior uncert ainty of the cluster ignments that induces variable with each do cument in the collection. Most some averaging over the class-conditional word distribu- closely related to our approach is the distrib nal tions P(wl=)
\segment 1" \segment 2" \matrix 1" \matrix 2" \line 1" \line 2" \power 1" power 2" imag speaker robust manufactur constraint alpha POWER load SEGMENT speech MATRIX cell LINE redshift spectrum memori texture recogni eigenvalu part match LINE omega vlsi color signal uncertainti MATRIX locat galaxi mpc POWER tissue train plane cellular imag quasar hsup systolic brain hmm linear famili geometr absorp larg input slice source condition design impos high redshift complex cluster speakerind. perturb machinepart segment ssup galaxi arrai mri SEGMENT root format fundament densiti standard present volume sound suci group recogn veloc model implement Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most probable words in the class-conditional distribution P (wjz), from top to bottom in descending order. Document 1, P fzkjd1; wj = `segment`g = (0:951; 0:0001;:::) P fwj = `segment`jd1g = 0:06 SEGMENT medic imag challeng problem eld imag analysi diagnost base proper SEGMENT digit imag SEGMENT medic imag need applic involv estim boundari ob ject classif tissu abnorm shape analysi contour detec textur SEGMENT despit exist techniqu SEGMENT specif medic imag remain crucial problem [...] Document 2, P fzkjd2; wj = `segment`g = (0:025; 0:867;:::) P fwj = `segment`jd2g = 0:010 consid signal origin sequenc sourc specif problem SEGMENT signal relat SEGMENT sourc address issu wide applic eld report describ resolu method ergod hidden markov model hmm hmm state correspond signal sourc signal sourc sequenc determin decod procedur viterbi algorithm forward algorithm observ sequenc baumwelch train estim hmm paramet train materi applic multipl signal sourc identif problem experi perform unknown speaker identif [...] Figure 4: Abstracts of 2 exemplary documents from the CLUSTER collection along with latent class posterior probabilities P fzjd; w = `segment'g and word probabilities P fw = `segment'jdg. (iv) 'Power' is used in the context of radiating objects in astronomy, but also in electrical engineering. Figure 4 shows the abstracts of two exemplary docu- ments which have been pre-processed by a standard stop-word list and a stemmer. The posterior probabilities for the classes given the dierent occurrences of `segment' indicate how likely it is for each of the factors in the rst pair of Figure 3 to have generated this observation. We have also displayed the estimates of the conditional word probabilities P fw = `segment'jd1;2g. One can see that the correct meaning of the word `seg- ment' is identied in both cases. This implies that although `segment' occurs frequently in both document, the overlap in the factored representation is low, since `segement' is identied as a polysemous word (relative to the chosen resolution level) which { dependent on the context { is explained by dierent factors. 3.5 Aspects versus Clusters It is worth comparing the aspect model with statistical clustering models (cf. also [7]). In clustering models for documents, one typically associates a latent class variable with each document in the collection. Most closely related to our approach is the distributional clustering model [10, 7] which can be thought of as an unsupervised version of a naive Bayes' classier. It can be shown that the conditional word probability of a probabilistic clustering model is given by P (wjd) = X z2Z P fc(d) = zgP (wjz) ; (7) where P fc(d) = zg is the posterior probability of document d having latent class z. It is a simple implication of Bayes' rule that these posterior probabilities will concentrate their probability mass on a certain value z with an increasing number of observations (i.e., with the length of the document). This means that although (1) and (7) are algebraically equivalent, they are conceptually very dierent and yield in fact dierent results. The aspect model assumes that document-specic distributions are a convex combination of aspects, while the clustering model assumes there is just one cluster-specic distribution which is inherited by all documents in the cluster.1 Thus in clustering models the class-conditionals P (wjz) have 1 In the distributional clustering model it is only the posterior uncertainty of the cluster assignments that induces some averaging over the class-conditional word distributions P (wjz)
ter)of documents, while factors can focus on certain For example, a factor can be very uell used to ex plain some raction of the words occurring in a doc ument, al though it might not explain other words at all(e. g, even assign zero probability ) because thes other words can be taken care of by other factors 3.6 Model ei itt a by i tea eming Figure 5: Perplexity results as a function of the So far ue have focused on maximumlikelihood estima- and(b)the Lob data (rank 1674).Plotted tion to fit a mo del from a given do cument collectior are for LSA (dashed-dotted curve) and PLSA(trained Although the likelihood or alent ly, the perple by TEM d curve, trained by early stopping EM ity is the quantity we believe to be cruci al in assessing dotted curve). The upper baseline is the unigram the quality of a model, one clearly has to distinguish model corresponding to marginal independence. TI betueen the performance on the training data and on star at the right end of the Plsa denotes the perplex unseen test data. To derive conditions under which ity of the largest trained aspect models(K=2048) ener tually the fundamental problem of st atis tical learning theory. Here, ue propose a generali zation of maxi- This shous that the effect of the entropy at fi 1 mum likelihood for mixture models which is known as to dampen the posterior probabilities such that they annealin and is based on an entropic regularization lI become closer to the uniform distribution with term. The resulting method is called Tempered Expec- decreasing fi tation Maximization(TEM)and is closely rel ated te deterministic annealin [11] Somewhat contrary to the spirit of annealing as a con tinuation method, ue propose an inverse'annealing The starting point of TEM is a derivation of the E- strategy which first performs EM iterations and step based on an optimi zation principle. As has been decreases fi until performance on hel d-out data deter pointed out in [9 the EM procedure in latent variable orates. Comp ared to annealing this may accelerate the nodels can be obtained by minimi zing a common ob- model fit ting procedure signi ficantly (e.g, by a facto jective function-the(Helmholtz)free ener)-which of N 10-50) and we have not found the test set per- for the aspect model is given by formance of heated models to be worse than the one fi >n(d, w)2P(2, d,w)log P(d, wl=)P(a) algorithm can thus be implemented in the following +∑n(d,w)∑P(,d,w)ogP(2,d,w):(8) 1. Set fi +l and per form EM with early stopping Here P(z, d, w)are vari ational parameters which de 2. Decrease fi -nfi (with n< 1)and perform fine a conditional distribution over 2 and fi is a pa TEM iteration rameter which in analogy to physical systems n held-out data i called the inverse comp ut ational temp erat ure. Noti that the first contribution in( 8) is the negative ex- proves (non-negligible)continue TEM iter ations pected log-likelihood scaled by fi. Thus in the case of at this value of fi, ot her wise goto step 2 P(a, d, w)=P(=ld, w)minimizing F w.r.t. the param- 4. Perform stopping on fi, i.e., stop when decreasing eters defining P(d, w=)P(a) amounts to the st andard fi does not yield further improvements M-step EM. In fact, it is straightforward to ver ify that the posteriors are obtained by minimi zing F al p is determined by 4 Experimental Results [P()P(da)P(w|2)15 P(,d,w)= 2:P()P(d=)P(w2)] (9) In the experimental evaluation, ue focus on tuo tasks (i) perplexity minimi zation for a document-specific un Perplexity re fers to the log-averaged inverse probabil- igram model and noun-adjective pairs, and (ii)auto- ty on unseen dat a mated indexing of do cuments The evaluation of lsa
to capture the complete vocabulary of a subset (cluster) of documents, while factors can focus on certain aspects of the vocabulary of a subset of documents. For example, a factor can be very well used to explain some fraction of the words occurring in a document, although it might not explain other words at all (e.g., even assign zero probability), because these other words can be taken care of by other factors. 3.6 Model Fitting Revisited: Improving Generalization by Tempered EM So far we have focused on maximum likelihood estimation to t a model from a given document collection. Although the likelihood or, equivalently, the perplexity2 is the quantity we believe to be crucial in assessing the quality of a model, one clearly has to distinguish between the performance on the training data and on unseen test data. To derive conditions under which generalization on unseen data can be guaranteed is actually the fundamental problem of statistical learning theory. Here, we propose a generalization of maxi- mum likelihood for mixture models which is known as annealing and is based on an entropic regularization term. The resulting method is called Tempered Expectation Maximization (TEM) and is closely related to deterministic annealing [11]. The starting point of TEM is a derivation of the Estep based on an optimization principle. As has been pointed out in [9] the EM procedure in latent variable models can be obtained by minimizing a common objective function { the (Helmholtz) free energy { which for the aspect model is given by F = X d;w n(d; w)X z P~(z; d; w) log P (d; wjz)P (z) +X d;w n(d; w)X z P~(z; d; w) log P~(z; d; w) : (8) Here P~ (z; d; w) are variational parameters which de- ne a conditional distribution over Z and is a parameter which { in analogy to physical systems { is called the inverse computational temperature. Notice that the rst contribution in (8) is the negative expected log-likelihood scaled by . Thus in the case of P~ (z; d; w) = P (zjd; w) minimizing F w.r.t. the parameters dening P (d; wjz)P (z) amounts to the standard M-step in EM. In fact, it is straightforward to verify that the posteriors are obtained by minimizing F w.r.t. P~ at = 1. In general P~ is determined by P~(z; d; w) = [P (z)P (djz)P (wjz)] P z0 [P (z0)P (djz0)P (wjz0)] : (9) 2Perplexity refers to the log-averaged inverse probability on unseen data. 200 400 600 800 1000 1000 1500 2000 2500 3000 (a) (b) Latent space dimensions Perplexity PLSA EM TEM LSA 500 1000 1500 500 600 700 800 900 1000 1100 1200 1300 Latent space dimensions Perplexity PLSA LSA Figure 5: Perplexity results as a function of the latent space dimensionality for (a) the MED data (rank 1033) and (b) the LOB data (rank 1674). Plotted results are for LSA (dashed-dotted curve) and PLSA (trained by TEM = solid curve, trained by early stopping EM = dotted curve). The upper baseline is the unigram model corresponding to marginal independence. The star at the right end of the PLSA denotes the perplexity of the largest trained aspect models (K = 2048). This shows that the eect of the entropy at < 1 is to dampen the posterior probabilities such that they will become closer to the uniform distribution with decreasing . Somewhat contrary to the spirit of annealing as a continuation method, we propose an `inverse' annealing strategy which rst performs EM iterations and then decreases until performance on held-out data deteriorates. Compared to annealing this may accelerate the model tting procedure signicantly (e.g., by a factor of 10 50) and we have not found the test set performance of `heated' models to be worse than the one achieved by carefully `annealed' models. The TEM algorithm can thus be implemented in the following way: 1. Set 1 and perform EM with early stopping. 2. Decrease (with < 1) and perform one TEM iteration. 3. As long as the performance on held-out data improves (non-negligible) continue TEM iterations at this value of , otherwise goto step 2 4. Perform stopping on , i.e., stop when decreasing does not yield further improvements. 4 Experimental Results In the experimental evaluation, we focus on two tasks: (i) perplexity minimization for a document-specic unigram model and noun-adjective pairs, and (ii) automated indexing of documents. The evaluation of LSA