|
Realistic Visual Speech Synthesis Based on AAM Features and an Articulatory DBN Model with Constrained Asynchrony Authors: P. Wu, D. Jiang, H. Zhang and H. Sahli Publication Year: 2011 Pages: 59-64
Abstract: This paper presents a novel photo realistic visual speech synthesis method based on an audio visual articualtory dynamic Bayesian network model with constrained asynchrony (AF_AVDBN). Conditional probability
distributions are defined to control the asynchronies between the articulatory features, such as lips, tongue and glottis/velum. Perceptual linear prediction (PLP) features from audio speech and active appearance model (AAM) features from mouth images of the visual speech are adopted to train the AF_AVDBN model for continuous speech. An EM-based optimal visual feature learning algorithm is deduced given the input auditory speech and the trained AF_AVDBN parameters. Finally, photo realistic mouth images are synthesized from the learned AAM features. Objective evaluations show that the learned visual features using AF_AVDBN track the real parameters much more closely than those from the SA_DBN and SS_DBN model. Subjective evaluation results show that very high quality mouth animations can be obtained through the AF_AVDBN models. By considering the asynchronies
between articulatory features in AF_AVDBN (as well between audio and visual states in SA_DBN), the synchronism between the audio speech and mouth animations are well obtained. Moreover, the accuracy of the mouth animations from AF_AVDBN is much better than those from SA_DBN and SS_DBN because AF_AVDBN captures the dynamic
movements of articualtory features and thus model the pronunciation process more precisely.
|
|