Acoustic model and adaptation for automatic speech recognition and animal vocalization classification

Jidong Tao, Marquette University

Abstract

Automatic speech recognition (ASR) converts human speech to readable text. Acoustic model adaptation, also called speaker adaptation, is one of the most promising techniques in ASR for improving recognition accuracy. Adaptation works by tuning a general purpose acoustic model to a specific one according to the person who is using it. Speaker adaptation can be categorized by Bayesian-based, transformation-based and model combination-based methods. Model combination-based speaker adaptation has been shown to have an advantage over the traditional Bayesian-based and transformation-based adaptation methods when the amount of adaptation speech is as small as a few seconds. However, model combination-based rapid speaker adaptation has not been widely used in practical applications since it requires large amounts of speaker-dependent (SD) training data from multiple speakers. This research proposes a new technique, eigen-clustering , to eliminate the need for large quantities of speaker-labeled training utterances so that model combination-based adaptation can be started from much more inexpensive speaker-independent (SI) data. Based on principal component analysis (PCA), this technique constructs an eigenspace using each utterance in the training set. This proposed adaptation method can not only improve human speech recognition directly, but also contribute to animal vocalization analysis and behavior studies potentially. Application to the field of bioacoustics is especially meaningful because the amount of collected animal vocalization data is often limited and therefore fast adaptation methods are naturally suitable.

This paper has been withdrawn.