NCMMSC 01 20-22 NOV 01, Shenzhen china Mandarin pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and systems Department of Computer Science Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn,http:/sp.cs.tsinghuaeducn/fzheng
Mandarin Pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/ NCMMSC’01 20-22 NOV 01, Shenzhen, China
Motivation o In spontaneous speech, pronunciations of individual words are different there are often 今 Sound changes,and 今 Phone changes Change includes insertion deletion and substitution ☆上 or chinese an additional accent problem even people are speaking mandarin due to different dialect backgrounds(in Chinese, 7 major dialects) colloquialism, grammar, style a Goal: modelling the pronunciation variations s Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to Finding solutions to the pronunciation modelling theoretically and practically Center of speech Technology, Tsinghua University Slide 2
Center of Speech Technology, Tsinghua University Slide 2 Motivation ❑ In spontaneous speech, pronunciations of individual words are different, there are often ❖ Sound changes, and ❖ Phone changes. ❖ For Chinese ➢ an additional accent problem even people are speaking Mandarin, due to different dialect backgrounds (in Chinese, 7 major dialects) ➢ colloquialism, grammar, style ❑ Goal: modelling the pronunciation variations ❖ Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to. ❖ Finding solutions to the pronunciation modelling theoretically and practically Change includes insertion, deletion and substitution
Overview Authors Paper Source Database Method WER T. Fukada. Y. Sagisaka Automatic generation of a pronunciation dictionary based Japanese AnN 75.54% (ATR, Japan) on a pronunciation network( EuroSpeech97) Prediction 6744% M-K LIu Bo Xu Mandarin accent adaptation based on CI/cD Shangha Confusion 45.13% (NLPR, China) pronunciation modeling(ICASSP2000) Accent(Intel MatrIx 40.24% M Saraclar(CLSP, JHU) Pronunciation modeling by sharing Gaussian densities Switchboard Gaussian 50.10% H Nock(CUED, Cam, UK)I across phonetic models(EuroSpeech99) 48.70% K Ma, G. Zavaliagkos Pronunciation modeling for large vocabulary Switchboard 5460% (GTE /BBN, USA) conversational speech recognit ion(ICSLP'98) Callhome 5349% M. Riley(AT&T Labs) Stochastic pronunciation modelling from hand-labelled TIMIT+ICSIDecision 44.66% W. Byrne(CLSP, JHU) phonetic corpora(Speech Communicaion, 1999(29) Tree 44.05% D. Povey, P.C. Wooland Improved discriminative training techniques for large Discriminant.60% ( CUED, Cambridge, UK) vocabulary continuous speech recognit ion(ICASSP'2001) Switchboard Training 44.30% T Hain P C Woodland New features in the cu-htk system for transcription of NIST Hubs VTLN 5160% CUED, Cambridge, UK) conversational telephone speech(ICASSP 2001) (Telephone) MMIE 4700% Center of speech Technology, Tsinghua University Slide 3
Center of Speech Technology, Tsinghua University Slide 3 Overview Authors Paper Source Database Method WER T. Fukada, Y. Sagisaka (ATR, Japan) Automatic generation of a pronunciation dictionary based on a pronunciation network (EuroSpeech’97) Japanese Spontaneous ANN Prediction 75.54 % 67.44 % M-K Liu, Bo Xu (NLPR, China) Mandarin accent adaptation based on CI/CD pronunciation modeling (ICASSP’2000) Shanghai Accent (Intel) Confusion Matrix 45.13 % 40.24 % M. Saraclar (CLSP, JHU) H. Nock (CUED, Cam., UK) Pronunciation modeling by sharing Gaussian densities across phonetic models (EuroSpeech’99) Switchboard Gaussian Sharing 50.10 % 48.70 % K. Ma, G. Zavaliagkos (GTE / BBN, USA) Pronunciation modeling for large vocabulary conversational speech recognition (ICSLP’98) Switchboard Callhome Lexical Adaptation 54.60 % 53.49 % M. Riley (AT&T Labs) W. Byrne (CLSP, JHU) Stochastic pronunciation modelling from hand-labelled phonetic corpora (Speech Communicaion, 1999 (29)) TIMIT + ICSI Decision Tree 44.66 % 44.05 % D. Povey, P.C. Wooland (CUED, Cambridge, UK) Improved discriminative training techniques for large vocabulary continuous speech recognition (ICASSP’2001) NAB, Switchboard Discriminant Training 46.60 % 44.30 % T. Hain, P.C. Woodland (CUED, Cambridge, UK) New features in the cu-htk system for transcription of conversational telephone speech (ICASSP’2001) NIST Hub5E (Telephone) VTLN MMIE 51.60 % 47.00 %
Necessity to establish a new annotated spontaneous speech corpus a The existing databases(incl. Broadcast News, CallHome, CallFriend, ..)do not cover all the Chinese spoken language phenomena pl , Sound changes: voiced, unvoiced, nasalization ,s Phone changes: retroflexed, OoV-phoneme a The existing databases do not contain pronunciation variation Intormation for use of bootstrap training o A Chinese annotated Spontaneous Speech(CAss) Corpus was established before wsoo on lsp in jhu Completely spontaneous(discourses, lectures, . Remarkable background noise, accent background Recorded onto tapes and then digitalized Center of speech Technology, Tsinghua University Slide 4
Center of Speech Technology, Tsinghua University Slide 4 ❑ The existing databases (incl. Broadcast News, CallHome, CallFriend, …) do not cover all the Chinese spoken language phenomena ❖ Sound changes: voiced, unvoiced, nasalization, … ❖ Phone changes: retroflexed, OOV-phoneme, … ❑ The existing databases do not contain pronunciation variation information for use of bootstrap training ❑ A Chinese Annotated Spontaneous Speech (CASS) Corpus was established before WS00 on LSP in JHU ❖ Completely spontaneous (discourses, lectures, ...) ❖ Remarkable background noise, accent background, ... ❖ Recorded onto tapes and then digitalized Necessity to establish a new annotated spontaneous speech corpus
Chinese Annotated Spontaneous speech (CASS) Corpus o CAss w/Five-Tier Transcription 令 Character level base form Syllable(or Pinyin) Level (w/tone base form Initial/Final (F level w/time boundary for baseform 令 SAMPA- C Level surface form 今 Miscellaneous level used for garbage modeling Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur(unclear), modal, smack, non-Chinese xample Character 我们 认 点 SⅤable wo3 menO rent shio alan rer CASS Syllable wo3 menO duol ren 4 shio diana ren2 IF uom@_nt uo z'@_n i't iE n z'@ GIF uo @n tvu z@_ zan Misc noise< noise> Center of speech Technology, Tsinghua University Slide 5
Center of Speech Technology, Tsinghua University Slide 5 ❑ CASS w/ Five-Tier Transcription ❖ Character level : base form ❖ Syllable (or Pinyin) Level (w/ tone) : base form ❖ Initial/Final (IF) Level : w/ time boundary for baseform ❖ SAMPA-C Level : surface form ❖ Miscellaneous Level : used for garbage modeling ➢ Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur (unclear), modal, smack, non-Chinese ❖ Example Character 我 们 多 认 识 点 人 Syllable wo3 men0 duo1 ren4 shi0 dian3 ren2 CASS Syllable wo3 men0 duo1 ren4 shi0 dianr3 ren2 IF uo m @_n t uo z` @_n s` i` t iE_n z` @_n GIF uo @_n t_v uo z` @_n s`_v t_v ia` z` @_n Misc noise< noise> mum< mum> Chinese Annotated Spontaneous Speech (CASS) Corpus