Overview of Systems to be described Rescoring: log-Linear score combination p(MFCC, PLPword), p(word words) First-Pass asr Word lattice Ip(SVMword word label start end times Pronunciation Model (dbn or MaxEnt) p(landmarkS) Acoustic model: svms concatenate 4-15 frames MFCC(5ms lms frame period), Formants, Phonetic auditory model Parameters
… … Acoustic Model: SVMs p(landmark|SVM) MFCC (5ms & 1ms frame period), Formants, Phonetic & Auditory Model Parameters concatenate 4-15 frames Pronunciation Model (DBN or MaxEnt) First-Pass ASR Word Lattice p(SVM|word) Rescoring: Log-Linear Score Combination p(MFCC,PLP|word), p(word|words) word label, start & end times Overview of Systems to be Described
I Acoustic Modeling Goal: Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature Methods Large input vector space(many acoustic feature types) Regularized binary classifiers(SVMs) SVM outputs"smoothed" using dynamic programming SVM outputs converted to posterior probabi estimates once/5ms using histogram
I. Acoustic Modeling • Goal: Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature. • Methods: – Large input vector space (many acoustic feature types) – Regularized binary classifiers (SVMs) – SVM outputs “smoothed” using dynamic programming – SVM outputs converted to posterior probability estimates once/5ms using histogram
Speech Databases SI Ize Phonetic Word lattices T transcr NTIMIT 14hrs manual WS96&97 3.5hrs manual SWB1 WS04 subset 12hrs auto-SRI BBN Evalo1 10hrs bbn sri rto3 Dev 6hrs SRI RTO3 Eval 6hrs SRI
Speech Databases Size Phonetic Transcr. Word Lattices NTIMIT 14hrs manual - WS96&97 3.5hrs manual - SWB1 WS04 subset 12hrs auto-SRI BBN Eval01 10hrs - BBN & SRI RT03 Dev 6hrs - SRI RT03 Eval 6hrs - SRI
Acoustic and auditory Features MFCCS, 25ms window(standard asr features) Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond Noise-robust MUSIC-based formant frequencies amplitudes, and bandwidths(zheng hasegawa Johnson, ICSLP 2004) Acoustic-phonetic parameters formant-based relative spectral measures and time-domain measures Bitar espy-Wilson, 1996) Rate-place model of neural response fields in the cat auditory cortex ( Carlyon shamma, JASA 2003)
Acoustic and Auditory Features • MFCCs, 25ms window (standard ASR features) • Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond • Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths (Zheng & HasegawaJohnson, ICSLP 2004) • Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996) • Rate-place model of neural response fields in the cat auditory cortex (Carlyon & Shamma, JASA 2003)
What are distinctive Features? What are landmarks? · Distinctive feature= a binary partition of the phonemes (Jakobson, 1952) that compactly describes pronunciation variability (halle and correlates with distinct acoustic cues(Stevens) Landmark Change in the value of a manner Feature [+sonorant to [sonorant], [-sonorant to [+sonorant 5 manner features: Consonantal, continuant, syllabic, silence] Place and Voicing features: SVMs are only trained at landmarks Primary articulator: lips, tongue blade, or tongue body Features of primary articulator: anterior, strident Features of secondary articulator nasal, voiced
What are Distinctive Features? What are Landmarks? • Distinctive feature = – a binary partition of the phonemes (Jakobson, 1952) – … that compactly describes pronunciation variability (Halle) – … and correlates with distinct acoustic cues (Stevens) • Landmark = Change in the value of a Manner Feature – [+sonorant] to [–sonorant], [–sonorant] to [+sonorant] – 5 manner features: [consonantal, continuant, syllabic, silence] • Place and Voicing features: SVMs are only trained at landmarks – Primary articulator: lips, tongue blade, or tongue body – Features of primary articulator: anterior, strident – Features of secondary articulator: nasal, voiced