O-COCOSDA. Oct. 1-3. 2003 Sentosa, singapore Making Full Use of Chinese speech Corpora Thomas Fang Zheng Center of speech Technology State Key laboratory of intelligent Technology and Systems Tsinghua University http://sp.cs.tsinghuaedu.cn, Beijing d-Ear Technologies Co. Ltd http://www.d-ear.com Oct.2,2003
Making Full Use of Chinese Speech Corpora Thomas Fang Zheng Center of Speech Technology State Key Laboratory of Intelligent Technology and Systems Tsinghua University http://sp.cs.tsinghua.edu.cn/ Beijing d-Ear Technologies Co., Ltd. http://www.d-Ear.com Oct. 2, 2003 O-COCOSDA, Oct. 1-3, 2003 Sentosa, Singapore
ecur 得意音通技术 2 Outline Your Partnerin the Century of Speech aPurpose of speech corpora U factors to be considered in data creation 日 Data creation 日 Data transcription ULearning from corpora aChinese Corpus Consortium(CCc)
Your Partner in the Century of Speech 2 Outline ❑Purpose of speech corpora ❑Factors to be considered in data creation ❑Data creation ❑Data transcription ❑Learning from corpora ❑Chinese Corpus Consortium (CCC)
ecur 得意音通技术 Purpose of Speech Corpora Your Partnerin the Century of Speech Item Description Percentage 1. Speech/ system development, evaluation, sentence 73% speaker comprehension and summarization, speech recognition recognition, speaker recognition 2. Speech system development, prosodic analysis 11% synthesis 3. Acoustic acoustic analysis, speech codin g 9% analVSiS 4. Sentence syntactic and semantic analysis 5% analysis 5. Speech/ speech and language education 2% language education
Your Partner in the Century of Speech 3 Purpose of Speech Corpora Item Description Percentage 1. Speech/ speaker recognition system development, evaluation, sentence comprehension and summarization, speech recognition, speaker recognition 73% 2. Speech synthesis system development, prosodic analysis 11% 3. Acoustic analysis acoustic analysis, speech coding 9% 4. Sentence analysis syntactic and semantic analysis 5% 5. Speech/ language education speech and language education 2%
ecur 得意音通技术 Outline Your Partnerin the Century of Speech PUrpose of speech corpora FActors to be considered in data creation 日 Data creation 日 Data transcription ULearning from corpora aChinese Corpus Consortium(CCc)
Your Partner in the Century of Speech 4 Outline ❑Purpose of speech corpora ❑Factors to be considered in data creation ❑Data creation ❑Data transcription ❑Learning from corpora ❑Chinese Corpus Consortium (CCC)
ecur 得意音通技术 5 Factors to be considered in data creation(1) Your Partnerin the Century of Speech 口 The language Language: e. g, Chinese or English i Dialectal background (e.g, for Chinese Putonghua or standard Chinese(普通话); Mandarin(官话, northern china Wu(xia, Southern Jiangsu, Zhejiang, and Shanghai Yue(ia, Guangdong, Hong Kong, Nanning Guangxi Min(闽南话, Fujian, Shantou guangdong, Haikou hainan, Taipei Taiwan kka(客家话, Meixian guangdong,Hsn- Chu Taiwan); Xiang(湘, Hunan); Gan(赣, Jiangxi; Hui(徽, Anhui;and Jn(晋, Shanxi ☆ Special for chinese: Simplified chinese Traditional chinese
Your Partner in the Century of Speech 5 Factors to be considered in data creation (1) ❑ The language. ❖ Language: e.g., Chinese or English ❖ Dialectal background (e.g., for Chinese) :- ▪ Putonghua or standard Chinese (普通话); ▪ Mandarin (官话,Northern China); ▪ Wu (吴语,Southern Jiangsu, Zhejiang, and Shanghai); ▪ Yue (粤语,Guangdong, Hong Kong, Nanning Guangxi); ▪ Min (闽南话,Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan); ▪ Hakka (客家话,Meixian Guangdong, Hsin-Chu Taiwan); ▪ Xiang (湘,Hunan); ▪ Gan (赣,Jiangxi); ▪ Hui (徽,Anhui); and ▪ Jin (晋,Shanxi). ❖ Special for Chinese :- ▪ Simplified Chinese ▪ Traditional Chinese