ETRO VUB
About ETRO  |  News  |  Events  |  Vacancies  |  Contact  
Home Research Education Industry Publications About ETRO
AVSP Demos

Projects Downloads Demos Team AVSP Home
Audiovisual text to speech synthesis

Introduction

Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Lately much interest goes out to data-driven 2D photorealistic synthesis, where the system uses a database of pre-recorded auditory and visual speech data to construct the target output signal. We propose a synthesis technique that creates both the target auditory and the target visual speech by using a same audiovisual database. To achieve this, the well-known unit selection synthesis technique is extended to work with multimodal segments containing original combinations of audio and video [1]. This strategy results in a multimodal output signal that displays a high level of audiovisual correlation, which is crucial to achieve a quality perception of the synthetic speech signal [2-3]. In an additional research, we trained an active appearance model [4] to transform the visual speech contained in the visual database into trajectories of model parameters. Using these model parameters, an accurate segment selection and a balanced concatenation smoothing can be achieved [5-6].

Fig. 1: functional block scheme of the multimodal speech synthesis.

Sample syntheses Dutch

Both auditory and audiovisual speech synthesis have been the subject of many research projects throughout the years. Unfortunately, in recent years only very little research focuses on synthesis for the Dutch language. Especially for audiovisual synthesis, hardly any available system or resource can be found. In [7] we describe the creation of a new extensive Dutch speech database, containing audiovisual recordings of a single speaker. The database is constructed as such it can be employed in both auditory and audiovisual speech synthesis systems. The speech was recorded in our recording studio located at the university campus [8]. The database consists of 1199 audiovisual sentences (138 min) from the open domain and 536 audiovisual sentences (52 min) from the limited domain of weather forecasts. We applied this database in both our auditory and audiovisual speech synthesis frameworks. Here we show some audiovisual demo syntheses, both from the open domain and from the limited domain of weather forecasts. Note that we will regularly update these samples, since many optimizations can still be added to the synthesis, such as a more detailed AAM modeling and an automatically optimized set of selection costs weights. A reliable free xvid/h264 decoder can be found here.

Some general demos of the AVTTS system

Note: don't play the samples below directly in your browser but use 'save as' to download the files and then use a media player with video size set to 100% and no scaling to play the videos!

Limited domain

“In het binnenland waait een matige oost- tot noordoostenwind.”

dutch_demo1_xvid / dutch_demo1_h264


“Aan zee waait die soms vrij krachtig uit het noordoosten met pieken tot vijftig kilometer per uur.”

dutch_demo2_xvid / dutch_demo2_h264


“De maxima variëren van 20 graden aan zee, 23 graden op de Hoge Venen en tot plaatselijk 26 graden elders in het land.”

dutch_demo3_xvid / dutch_demo3_h264


“Morgen is het meestal zwaarbewolkt met regen in het binnenland.”

dutch_demo4_xvid< / dutch_demo4_h264

Open domain

“De spooktrein kwam met een grote schok tot stilstand.”

dutch_open_demo1_xvid / dutch_open_demo1_h264


“De spraak die u nu kunt horen en kunt zien in samengesteld uit kleine segmenten originele spraak.”

dutch_open_demo2_xvid / dutch_open_demo2_h264


“Zeven zotten zwemden zeven zomerse zondagen zonder zwembroek. Daarbij krabt de kat de krollen van de trap en poetst de koetsier plots zijn postkoets met postkoetspoets.” (typical Dutch tongue twister)

dutch_open_demo3_xvid / dutch_open_demo3_h264

Sample syntheses English

To create these sentences we used the audiovisual speech database from the LIPS2008 visual speech synthesis challenge [9]. This dataset contains about 250 English sentences, resulting in a total length of 20 minutes of continuous speech. The database consists of good quality visual data which is designed for visual speech synthesis. It is only sub-optimal for usage with auditory speech synthesis, since the pronunciation of the auditory speech of the database could be clearer and more standard UK English. Furthermore, to obtain high quality audio synthesis, the auditory speech database used by the system should consist of minimal 1-2 hour(s) of continuous speech. Nevertheless, it is a very useful dataset to test, evaluate and compare our AVTTS system, as will be demonstrated by the synthesized sentences showed hereafter. A reliable free xvid/h264 decoder can be found here.

Note: don't play the samples below directly in your browser but use 'save as' to download the files and then use a media player with video size set to 100% and no scaling to play the videos!

“The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech.”

english_open_demo1_xvid / english_open_demo1_h264


“That's why darling, it's incredible, that someone so unforgettable, thinks that I am, unforgettable too.”

english_open_demo2_xvid / english_open_demo2_h264


“The good thing about being an engineer is the fact that simple problems are solved quite quickly. The downside is that most of the time, an engineer sees in everything a problem.”

english_open_demo3_xvid / english_open_demo3_h264

Viseme-based synthesis

The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. These viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to the visual co-articulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme. In our latest research [10] is was found that neither the use of standardized nor speaker specific many-to-one viseme labels could improve the quality of concatenative visual speech synthesis. Therefore, a novel technique to define a many-to-many phoneme-to-viseme mapping scheme was investigated, which makes use of both tree-based and k-means clustering approaches. We show that these many-to-many viseme labels are able to more accurately describe the visual speech information as compared with both phoneme-based and many-to-one viseme-based speech labels. In addition, we found that the use of these many-to-many visemes improves the precision of the segment selection phase in concatenative visual speech synthesis using limited speech databases. Furthermore, the resulting synthetic visual speech was both objectively and subjectively found to be of higher quality when the many-to-many visemes are used to describe both the speech database and the synthesis targets.

Here we show some demo syntheses, using viseme-based and phoneme-based visual speech synthesis strategies.  The text input for these samples were original text transcripts from the database. The input auditory speech was original database audio. While synthesizing, each time the particular sentence was excluded from the database.

Note: don't play the samples below directly in your browser but use 'save as' to download the files and then use a media player with video size set to 100% and no scaling to play the videos!

A) Using the full Dutch speech database (1700 sentences)

“Ons land mag zich ook dit jaar weer verheugen in het behoud van vrede.”

-phoneme-based: h264

-viseme-based: h264

“Over de reden van zijn ontslag blijven grote vraagtekens.”

-phoneme-based: h264

-viseme-based: h264

B) Using a small subset from the database (33 sentences)

“Ik heb ondertussen een wolk van een dochter gekregen die mij meer dan genoeg uitdaging bezorgt.”

-phoneme-based: h264

-viseme-based: h264

“Uiteraard is dit een mijlpaal in de geschiedenis van de onderneming.”

-phoneme-based: h264

-viseme-based: h264

Scientific Contacts

References

[1] Mattheyses, W., Latacz, L., Verhelst, W. and Sahli, H., "Multimodal Unit Selection for 2D Audiovisual Text-to-speech Synthesis", Lecture Notes in Computer Science, pp.125-136, Springer, 2008

[2] Mattheyses, W., Latacz, L. and Verhelst, W., "On the importance of audiovisual coherence for the perceived quality of synthesized visual speech", EURASIP Journal on Audio, Speech, and Music Processing, SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation, 2009

[3] Mattheyses, W., Latacz, L. and Verhelst, W., "Multimodal Coherency Issues in Designing and Optimizing Audiovisual Speech Synthesis Techniques", International Conference on Auditory-Visual Speech Processing, pp.47-52, 2009

[4] Cootes, T.F., Edwards, G.J. and Taylor, C.J., "Active Appearance Models", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.23(6), pp.681-685, 2001

[5] Mattheyses, W., Latacz, L. and Verhelst, W., "Active Appearance Models for Photorealistic Visual Speech Synthesis", Interspeech, pp.1113-1116, 2010

[6] Mattheyses, W., Latacz, L. and Verhelst, W., "Optimized Photorealistic Audiovisual Speech Synthesis Using Active Appearance Modeling", International Conference on Auditory-visual Speech Processing, pp.148-153, 2010

[7] Mattheyses, W., Latacz, L. and Verhelst, W., "Auditory and Photo-realistic Audiovisual Speech Synthesis for Dutch", International Conference on Auditory-visual Speech Processing, pp.53-58, 2011

[8] AV Lab, the audio-visual laboratory of ETRO", Online: Nosey Elephant Studios

[9] B. Theobald, S. Fagel, G. Bailly and F. Elisei, "LIPS2008: Visual Speech Synthesis Challenge", Interspeech '08,  pp.1875-1878, 2008

[10] Mattheyses, W., Latacz, L. and Verhelst, W., "Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis", Speech Communication, Vol.55(7-8), pp.857-876, 2013

• Back to AVSP Research Domains

• Back to ETRO Research

- Contact person

- IRIS

- AVSP

- LAMI

- Contact person

- Thesis proposals

- ETRO Courses

- Contact person

- Spin-offs

- Know How

- Journals

- Conferences

- Books

- Vacancies

- News

- Events

- Press

Contact

ETRO Department

Tel: +32 2 629 29 30

©2024 • Vrije Universiteit Brussel • ETRO Dept. • Pleinlaan 2 • 1050 Brussels • Tel: +32 2 629 2930 (secretariat) • Fax: +32 2 629 2883 • WebmasterDisclaimer