Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST
1 Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST
Keyword Extraction Goal given n documents each consisting of words, extract the most significant subset of words> keywords Example [All the students are taking exams]-->[student, take, exam Keyword Extraction Process remove stop words stem remaining terms collapse terms using thesaurus build inverted index extract key words-build key word index extract key phrases- build key phrase index
2 Keyword Extraction ◼ Goal: ◼ given N documents, each consisting of words, ◼ extract the most significant subset of words → keywords ◼ Example ◼ [All the students are taking exams] -- >[student, take, exam] ◼ Keyword Extraction Process ◼ remove stop words ◼ stem remaining terms ◼ collapse terms using thesaurus ◼ build inverted index ◼ extract key words - build key word index ◼ extract key phrases - build key phrase index i t
Stop Words and Stemming From a given Stop Word list a, about, again, are, the to, of, Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf's law: Frequency * rank x constant most frequent words tend to be short most frequent 20% of words account for 60% of usage
3 Stop Words and Stemming ◼ From a given Stop Word List ◼ [a, about, again, are, the, to, of, …] ◼ Remove them from the documents ◼ Or, determine stop words ◼ Given a large enough corpus of common English ◼ Sort the list of words in decreasing order of their occurrence frequency in the corpus ◼ Zipf’s law: Frequency * rank constant ◼ most frequent words tend to be short ◼ most frequent 20% of words account for 60% of usage
Zipf's Law--An illustration Rank(R) Term Frequency(F)R*F(10**6) the 69971 0.070 of 36411 0.073 23456789 and 28.852 0.086 to 26.149 0.104 a 232370.116 In 21.341 0.128 that 10.5950074 10009 0.081 was 9816 0.088 10 he 9543 0.095
4 Zipf’s Law -- An illustration Rank(R) Term Frequency (F) R*F (10**6) 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,009 0.081 9 was 9,816 0.088 10 he 9,543 0.095
Resolving power of word Non-significant Non-significant high-frequency low-frequency terms terms Presumed resolving power of significant words Words in decreasing frequency order
5 Resolving Power of Word Words in decreasing frequency order Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words