Latent Semantic Indexing(LSI) Perform a low-rank approximation of term- document matrix(typical rank 100-300) 。General idea Map documents (and terms)to a low-dimensional representation. Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). Compute document similarity based on the inner product in this latent semantic space CCF-ADL at Zhengzhou University, June25-27,2010 17
Latent Semantic Indexing (LSI) • Perform a low-rank approximation of termdocument matrix (typical rank 100-300) • General idea – Map documents (and terms) to a low-dimensional representation. – Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). – Compute document similarity based on the inner product in this latent semantic space CCF-ADL at Zhengzhou University, June 25-27, 2010 17
What it is ·从原始的term-document矩阵A,我们计算得到它的近似Ak ·在Ak中,每行对应一个term,每列对应一个document ·区别是,文档在新的空间,它的维度k<r dimensions ·怎样比较两个term? AAKT =TEDTD ET TT=(TE)(TZ)T ·怎样比较两个document? ATA=DETTT TEDT=(②DT(②D ·怎样比较一个term和一个文档? A[Lj] CCF-ADL at Zhengzhou University, 18 June25-27,2010
What it is • 从原始的term-document矩阵Ar , 我们计算得到它的近似Ak. • 在Ak 中,每行对应一个term,每列对应一个document • 区别是,文档在新的空间,它的维度 k << r dimensions • 怎样比较两个term? • 怎样比较两个document? • 怎样比较一个term和一个文档? AK TAk = D T TT TDT = (DT) T( DT) Ak [I,j] AkAK T =TDT D T TT= (T)(T) T CCF-ADL at Zhengzhou University, June 25-27, 2010 18
LSI Term matrix T 。T matrix -每个term在LSI space的向量 -原始matrix:termsl向量是d-dimensional,.T中要 小很多 -Dimensions,是在相同文档中倾向于与这个词“ 同现”的一组terms synonyms,contextually-related words,variant endings -(T)用来计算term相似度 CCF-ADL at Zhengzhou University, June25-27,2010 19
LSI Term matrix T • T matrix – 每个term在LSI space的向量 – 原始matrix: terms向量是d-dimensional,T中要 小很多 – Dimensions是在相同文档中倾向于与这个词“ 同现”的一组terms • synonyms, contextually-related words, variant endings – (T) 用来计算term相似度 CCF-ADL at Zhengzhou University, June 25-27, 2010 19