Development of Kinematic Templates for Automatic Pronunciation Assessment Using Acoustic-to-Articulatory Inversion
Date of Award
Master of Science (MS)
Electrical and Computer Engineering
Computer-aided pronunciation training (CAPT) is a subcategory of computer-aided language learning (CALL) that deals with the correction of mispronunciation during language learning. For a CAPT system to be effective, it must provide useful and informative feedback that is comprehensive, qualitative, quantitative, and corrective. While the majority of modern systems address the first 3 aspects of feedback, most of these systems do not provide corrective feedback. As part of the National Science Foundation (NSF) funded study “RI: Small: Speaker Independent Acoustic-Articulator Inversion for Pronunciation Assessment”, the Marquette Speech and Swallowing Lab and Marquette Speech and Signal Processing Lab are conducting a pilot study on the feasibility of the use of acoustic-to-articulatory inversion for CAPT. In order to evaluate the results of a speaker’s acoustic-to-articulatory inversion to determine pronunciation accuracy, kinematic templates are required. The templates would represent the vowels, consonant clusters, and stress characteristics of a typical American English (AE) speaker in the midsagittal plane. The Marquette University electromagnetic articulography Mandarin-accented English (EMA-MAE) database, which contains acoustic and kinematic speech data for 40 speakers (20 of which are native AE speakers), provides the data used to form the kinematic templates. The objective of this work is the development and implementation of these templates. The data provided in the EMA-MAE database is analyzed in detail, and the information obtained from the analysis is used to develop the kinematic templates. The vowel templates are designed as sets of concentric confidence ellipses, which specify (in the midsagittal plane) the ranges of tongue and lip positions corresponding to correct pronunciation. These ranges were defined using the typical articulator positioning of all English speakers of the EMA-MAE database. The data from these English speakers were also used to model the magnitude, speed history, movement pattern, and duration (MSTD) features of each consonant cluster in the EMA-MAE corpus. Cluster templates were designed as set of average MSTD parameters across English speakers for each cluster. Finally, English stress characteristics were similarly modeled as a set of average magnitude, speed, and duration parameters across English speakers. The kinematic templates developed in this work, while still in early stages, form the groundwork for assessment of features returned by the acoustic-to-articulatory inversion system. This in turn allows for assessment of articulatory inversion as a pronunciation training tool.