Formant tracking of noise-corrupted speech signals based on auditory modeling
Abstract
In many situations a speech signal may be corrupted by additive noise, thus reducing the intelligibility of the resulting signal. Certain digital algorithms have been developed to process such noise-corrupted signals to enhance their intelligibility. Some of these algorithms employ features extracted from the noise-corrupted signal to aid in the processing. One such set of features consists of the speech "formants." Formants are peaks or resonances in the spectrum of the speech signal. Their location in frequency, strength, and bandwidth as a function of time are very important in determining how the spoken information is perceived by the listener. It is thus critical to have available an accurate "formant tracker" (an algorithm to identify the formant frequencies as a function of time) in order to employ a speech enhancement system based on formant information. Many formant trackers have been proposed, but all lose their accuracy to a greater or lesser extent in the presence of noise, particularly large levels of noise. In this dissertation, a new class of formant tracking algorithms is proposed based on the use of an auditory model. The particular auditory model employed was proposed by Ghitza (1986, 1988, 1992, 1993, 1994). It consists of various mathematical stages that mimic the performance of the human auditory system. Since humans are rather good at understanding speech, even to some extent in noisy situations, it is expected that a formant tracker based on an auditory model may offer the potential of outperforming other formant trackers in the presence of high levels of noise. In this dissertation, various aspects of the auditory model are studied to determine what combination of model features and parameters are most useful in extracting formant information from a speech signal. A formant tracker algorithm is developed and implemented using the auditory model. This formant tracker is evaluated and compared (in terms of percent missed formants and root mean square error of formant frequencies) to two other standard formant trackers on a database of noise-corrupted speech utterances for which accurate formant information is known. The auditory formant tracker is shown to outperform the other formant trackers in high noise situations, especially for the first formant and for male speakers. Informal listening tests in high noise situations employing an existing formant-based processing system also suggest improved performance of the auditory formant tracker over the other formant trackers considered.
This paper has been withdrawn.