Duke University

Department of Electrical and Computer Engineering

Exploring Different Models of Speech Production:
Applications in Speech Coding and Speaker Recognition

Peter Brende & William Chan

Advisor: Dr. Gary Ybarra


    In the past 50 years, amazing advances have been made in the field of speech processing. The traditional and most commonly used model of speech production is one consisting of a source and an all-pole LTI filter. The filter models the vocal tract of the individual, while the source attempts to model the vocal chords and glottis. It is often assumed that the source generates an excitation sequence consisting of a periodic pulse train, white noise, or a combination of the two. This simple filter/source model has served as a foundation for the vast majority of speech coding, speech recognizing, and speaker recognizing systems.

    The starting point for nearly all speech coders is to inverse filter the speech signal to generate the estimated 'excitation' sequence or 'residual' signal. The main differences between speech coders lie in the way that this residual signal is encoded. We will evaluate two very different speech coders with respect to the quality of the reproduced speech and the bit-rate of the encoded data.

    In speech and speaker recognition, the frequency response of the all-pole filter, or a transformation thereof, is most often used in the classification process. In fact, the mel-cepstrum, which takes advantage of psychoacoustic perception of tone, has become the mostly widely accepted feature set for recognition tasks. A major problem with the mel-cepstrum and other spectral features is that they are rather susceptible to distortions from the transmission channel. Thus, a robust time-domain feature would be very beneficial.

    Another problem with the traditional feature sets is that the vocal tract is not a perfect LTI filter. The large net airflow through the vocal tract produces vortices (eddies), which can can become secondary sources of sound. The nature of these secondary sources likely varies from speaker to speaker, and may provide additional information about the speaker's identity beyond what the spectral features reveal.

    In the traditional model, the teager energy (an instantaneous measure of energy) of the speech signal exhibits a sharp increase (corresponding to the glottal pulse) once per pitch period and then decays exponentially. We will examine the teager energy patterns for several vowel sounds and for a number of different speakers. Using simple pattern modeling and classification techniques, we will investigate whether the teager energy patterns are more suited for an indicator of the speech sound or of the speaker. Finally, we will consider ways in which the teager energy features can be incorporated into existing speaker recognition systems.