语言模型 历些莞子种枝大等 XIDIAN UNIVERSITY ▣ Unigram Language Model =Zero-order Markov Chain M p(F|M)=Πp(m,|M)◆Bag of Words(BoW R,∈S One-hot编码 Bigram Language Model =First-order Markov Chain pm|M0)=Πpm;|-M) N-gram Language Model ==(N-1)-order Markov Chain 3/15/202G Yueshen Xu 11 计算机科学与技术学院
语言模型 Unigram Language Model == Zero-order Markov Chain Bigram Language Model == First-order Markov Chain N-gram Language Model == (N-1)-order Markov Chain 3/15/2020 计算机科学与技术学院 w s i i p(w | M ) p(w | M ) Bag of Words(BoW) One-hot 编码 w s i i i p(w | M ) p(w | w M ) 1, 11 w N M , Yueshen Xu
语言模型 历些毛子种技大学 XIDIAN UNIVERSITY Mixture-unigram Language Model ■混合语言模型 M N pw=∑a0uwlg ■回忆一下混合高斯分布 3/15/2020 2 计算机科学与技术学院
语言模型 Mixture-unigram Language Model 混合语言模型 回忆一下混合高斯分布 3/15/2020 12 计算机科学与技术学院 w N M z 𝑝 𝒘 = 𝑧 𝑝(𝑧)ෑ 𝑛=1 𝑁 𝑝(𝑤𝑛|𝑧)
TF-IDF 历些毛子科枝大” XIDIAN UNIVERSITY ▣TF:Term Frequency IDF:Inversed Document Frequency ▣TF-IDF Term i,document j,count of i in j id=loel+ldeD:4∈d◆Ndocumentsinh corpus THow important...in this document -idf(dfHow importantin this cor 3/15/2020 13 计算机科学与技术学院
TF-IDF TF: Term Frequency IDF: Inversed Document Frequency TF-IDF 3/15/2020 计算机科学与技术学院 k kj ij ij n n tf Term i, document j, count of i in j ) 1 |{ : }| log( d D t d N idf i i N documents in the corpus ij j ij i tf idf (t ,d , D) tf idf How important …in this document How important …in this corpus 13
隐语义分析与矩阵分解 历些毛子种找大” XIDIAN UNIVERSITY ▣向量空间模型的不足,Vector Space Model VSM Word Document LSI Word Concept Document Aspect Latent Topic Factor 3/15/2020 14 计算机科学与技术学院
隐语义分析与矩阵分解 向量空间模型的不足,Vector Space Model 3/15/2020 14 计算机科学与技术学院 Word Document Word Concept Document VSM LSI Aspect Topic Latent Factor
隐语义分析与矩阵分解 历些子种枝大学 XIDIAN UNIVERSITY ▣挑战 Compare document in the same concept space ■跨语言比较 ■同义词/近义词,ex:buy-purchase,user-consumer ■多义词,ex;book-book,draw-draw ▣核心的想法 ■ Dimensionality reduction of word-document co-occurrence matrix ■1 构建隐语义空间 3/15/2020 15 计算机科学与技术学院
隐语义分析与矩阵分解 挑战 Compare document in the same concept space 跨语言比较 同义词/近义词, ex: buy - purchase, user - consumer 多义词, ex; book - book, draw - draw 核心的想法 Dimensionality reduction of word-document co-occurrence matrix 构建隐语义空间 3/15/2020 15 计算机科学与技术学院