Estimation of cepstral coefficients for robust speech recognition

Kevin M Indrebo, Marquette University

Abstract

This dissertation introduces a new approach to estimation of the features used in an automatic speech recognition system operating in noisy environments, namely mel-frequency cepstral coefficients. A major challenge in the development of an estimator for these features is the nonlinear interaction between a speech signal and the corrupting ambient noise. Previous estimation methods have attempted to deal with this issue with the use of a low order Taylor series expansion, which results in a rough approximation of the true distortion interaction between the speech and noise signal components, and the estimators must typically be iterative, as it is the speech features themselves that are used as expansion points. The new estimation approach, named the additive cepstral distortion model minimum mean-square error estimator, uses a novel distortion model to avoid the necessity of a Taylor series expansion, allowing for a direct solution. Like many previous approaches, the estimator introduced in this dissertation uses a prior distribution model of the speech features. In previous work, this distribution is limited in specificity, as a single global model is trained over an entire set of speech data. The estimation approach introduced in this work extends this method to incorporate contextual information into the prior model, leading to a more specific distribution and subsequently better estimates of the features. An introduction to automatic speech recognition is presented, and a historical review of relevant feature estimation research is given. The new estimation approach is developed, along with a method for implementing more specific prior distribution modeling, and the new feature estimation approach is evaluated on two standard robust speech recognition datasets.

This paper has been withdrawn.