Landmark-Based Speech Recognition The marriage of high-Dimensional machine Learning Techniques with Modern Linguistic Representations Mark hasegawa-Johnson Thasegamauiuc edu Research performed in colla boration with James Baker( Carnegie Mellon), Sarah Borys(lino is) Ken Chen(linois), Emily Coogan(llinois). Steven Greenberg(Berkeley), Amit Juneja( Maryland), Katrin Kirchhoff (Washington), Karen Livescu(MIT), Srividya Mohan(Johns Hopkins), Jen muller( dept of Defense), Kemal Sonmez (sri, and Tianyu wang (georgia Tech)
Landmark-Based Speech Recognition The Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations Mark Hasegawa-Johnson jhasegaw@uiuc.edu Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT), Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)
What are landmarks Time-frequency regions of high mutual information between phone and signal (maxima of i(phone label; acoustics(t,f)) Acoustic events with similar importance in all languages, and across all speaking styles Acoustic events that can be detected even in extremely noisy environments Where do these things happen? Syllable onset consonant release Syllable nucleus Vowel Center Syllable coda a consonant closure I(phone; acoustics)experiment: Hasegawa-Johnson, 2000
What are Landmarks? • Time-frequency regions of high mutual information between phone and signal (maxima of I(phone label; acoustics(t,f)) ) • Acoustic events with similar importance in all languages, and across all speaking styles • Acoustic events that can be detected even in extremely noisy environments • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure Where do these things happen? I(phone;acoustics) experiment: Hasegawa-Johnson, 2000
Landmark-Based Speech Recognition Lattice hypothesis 3 backed up 已5 Words Times Scores Pronunciation 0 Variants W卜 200 backed up 100 backup 02 0.4 06 0.8 T back up ONSET ONSET backt ihp Syllable NUCLEUS UCLEUS wack ihp Structure CODA CODA
Landmark-Based Speech Recognition ONSET NUCLEUS CODA NUCLEUS CODA ONSET Pronunciation Variants: … backed up … … backtup .. … back up … … backt ihp … … wackt ihp… … Lattice hypothesis: … backed up … Syllable Structure Scores Words Times
Talk outline Overview 1. Acoustic Modeling Speech data and acoustic features Landmark detection Estimation of real-valued"distinctive features" using support vector machines(SVM 2. Pronunciation Modeling A Dynamic Bayesian network(DBn)implementation of Articulatory Phonology A Discriminative Pronunciation model implemented using Maximum Entropy(MaxEnt) 3. Technological Evaluation Rescoring of word lattice output from an hMm-based recognizer New errors that we caused: Pronunciation models trained on 3 hours can't compete with triphone models trained on 3000 hours Future plans
Talk Outline Overview 1. Acoustic Modeling – Speech data and acoustic features – Landmark detection – Estimation of real-valued “distinctive features” using support vector machines (SVM) 2. Pronunciation Modeling – A Dynamic Bayesian network (DBN) implementation of Articulatory Phonology – A Discriminative Pronunciation model implemented using Maximum Entropy (MaxEnt) 3. Technological Evaluation – Rescoring of word lattice output from an HMM-based recognizer – Errors that we fixed: Channel noise, Laughter, etcetera – New errors that we caused: Pronunciation models trained on 3 hours can’t compete with triphone models trained on 3000 hours. – Future Plans
Overview History Research described in this talk was performed between June 30 and August 17, 2004, at the Johns Hopkins summer workshop WS04 Scientific goal To use high-dimensional machine learning technologies (SVM, DBn to create representations capable of learning from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments Technological Goal Long-term: To create a better speech recognizer Short-term: lattice rescoring, applied to word lattices produced by SrIs nn/hmm hybrid
Overview • History – Research described in this talk was performed between June 30 and August 17, 2004, at the Johns Hopkins summer workshop WS04 • Scientific Goal – To use high-dimensional machine learning technologies (SVM, DBN) to create representations capable of learning, from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments • Technological Goal – Long-term: To create a better speech recognizer – Short-term: lattice rescoring, applied to word lattices produced by SRI’s NN/HMM hybrid