How might computers do it? Digitization Acoustic analysis of the speech signal Linguistic interpretation Acoustic waveform Acoustic signal 静中解 学需 an maris e va neri n a n :i rout u s even Speech recognition HUMAN COMPUTER INTERACTION
How might computers do it? Digitization Acoustic analysis of the speech signal Linguistic interpretation 1/28/2021 HUMAN COMPUTER INTERACTION 6 Acoustic waveform Acoustic signal Speech recognition
Outline Introduction Speech recognition based on HMm Acoustic processing Acoustic modeling: Hidden Markov Model anguage modeling Statistical approach HUMAN COMPUTER INTERACTION
Outline Introduction Speech recognition based on HMM • Acoustic processing • Acoustic modeling: Hidden Markov Model • Language modeling • Statistical approach 1/28/2021 HUMAN COMPUTER INTERACTION 7
Acoustic processing A wave for the words " speech lab"looks like p ee a 10000 1.20□ “to“a transition 0w个 Graphs from Simon Arnfield' s web tutorial on speech, Sheffield http://lethe.leedsac.uk/research/cogn/speech/tutoriall HUMAN COMPUTER INTERACTION
Acoustic processing A wave for the words “speech lab” looks like: 1/28/2021 HUMAN COMPUTER INTERACTION 8 s p ee ch l a b Graphs from Simon Arnfield’s web tutorial on speech, Sheffield: http://lethe.leeds.ac.uk/research/cogn/speech/tutorial/ “l” to “a” transition:
Acoustic sampling 10 ms frame( ms= millisecond =1/1000 second C25 ms window around frame to smooth signal processing 体体和个 I ms 10ms Result Acoustic Feature vectors -986,-792,-692,-614,-429,-286,-134,-57,-41,-169,-456,-450,-541,-761,-1067,-1231,-1847,-952,-645,-489,-448 -212,193,114,-17,-110,128,261,198,390,461,772,948,1451,1974,2624,3793,4968,5939,6057,6581,7302,7649,7223,6119,5461 4353,3611,2740,204,1349,1178,1085,901,301,-262,-499,-488,-707,-1406,-1997,-2377,-2494,-2605,-2675,-2627,-2500,-2148, 1648,-970,-364,13,260,494,788,1011,938,717,507,323,324,325,350,103,-113,64,176,93,-249,-461,-606,-909,-1159,-1397,-1544 HUMAN COMPUTER INTERACTION 9
Acoustic sampling 10 ms frame (ms = millisecond = 1/1000 second) ~25 ms window around frame to smooth signal processing 1/28/2021 HUMAN COMPUTER INTERACTION 9 25 ms 10ms . . . a1 a2 a3 Result: Acoustic Feature Vectors
Spectral analysis Frequency gives pitch; amplitude gives volume sampling at -8 kHz phone, -16 kHz mic(kHz=1000 cycles/sec) p ee ch 10000 10000 Fourier transform of wave yields a spectrogram darkness indicates energy at each frequency hundreds to thousands of frequency samples HUMAN COMPUTER INTERACTION
Spectral analysis Frequency gives pitch; amplitude gives volume • sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec) Fourier transform of wave yields a spectrogram • darkness indicates energy at each frequency • hundreds to thousands of frequency samples 1/28/2021 HUMAN COMPUTER INTERACTION 10 s p ee ch l a b