ETRO VUB
About ETRO  |  News  |  Events  |  Vacancies  |  Contact  
Home Research Education Industry Publications About ETRO

ETRO Events

A list of events ETRO is organizing or participating in.

PhD Defense
A Multimodal Approach to Audiovisual Text-to-Speech Synthesis

Presenter

Mr Wesley Mattheyses - ETRO-VUB

Abstract

Oral speech has always been the most important means of communication between humans. When a message is conveyed using oral speech, it is encoded in two separate signals: an auditory speech signal and a visual speech signal. The auditory speech signal consists of a series of speech sounds that are produced by the human speech production system. In order to generate different sounds, the parameters of this speech production system are varied. Since some of the human articulators are visible to an observer (e.g., the lips, the teeth and the tongue), while uttering the speech sounds the variations of these visible articulators define the visual speech signal. It is well known that an optimal conveyance of the message is possible only when both the auditory and the visual speech signals are perceived by the receiver.

During the last decades the development of advanced computer systems has led to the current situation in which the vast majority of appliances, from industrial machinery to small household devices, are computer-controlled. This implicates that at present day people interact countless times with computer systems in every-day situations. Since the ultimate goal is to make this interaction feel completely natural and familiar, the most optimal way to interact with a machine is by means of oral speech. Similar to the speech communication between humans, the most appropriate human-machine interaction consists of audiovisual speech.

In order to be able to transfer a spoken message from the machine towards the user, the device has to contain a so-called audiovisual speech synthesizer. This is a system that is capable of generating a novel audiovisual speech signal, mostly from text input (so-called audiovisual text-to-speech (AVTTS) synthesis). Audiovisual speech synthesis has been a popular topic during the last decade. The synthetic auditory speech mode, created by the synthesizer, consist of a waveform that resembles as closely as possible an original acoustic speech signal uttered by a human. The synthetic visual speech signal displays a virtual speaker exhibiting the speech gestures that match the auditory speech information. The great majority of the AVTTS synthesizers perform the synthesis in separate stages: in the first stages the auditory and the visual speech signals are consecutively and often completely independently synthesized, after which both synthetic speech modes are synchronized and multiplexed. Unfortunately, this strategy is unable to optimize the audiovisual coherence in the output signal. This motivates the development of a single-phase AVTTS synthesis approach, in which both speech modes are simultaneously generated which allows to maximize the coherence between the two synthetic speech signals.

In this work such a single-phase AVTTS synthesis technique was developed that constructs the desired speech signal by concatenating audiovisual speech segments that were selected from a database containing original audiovisual speech recordings from a single speaker. By selecting segments containing an original combination of auditory and visual speech information, the original coherence between both speech modes is copied as much as possible to the synthetic speech signal. Obviously, the simultaneous synthesis of the auditory and the visual speech entails some additional difficulties in optimizing the individual quality of both synthetic speech modes. Nevertheless, through subjective perception experiments is could be concluded that the optimization of the audiovisual coherence is indeed necessary for a high-quality perception of the synthetic audiovisual speech signal.

In the next part of the work it was investigated how the quality of the synthetic speech synthesized by the AVTTS system could be enhanced. First the individual quality of the synthetic visual speech mode was improved while assuring not to affect the audiovisual coherence. To this end, the original visual speech from the database was parameterized using an Active Appearance Model. This allows many optimizations, such as a normalization of the original speech data and a smoothing of the synthetic visual speech without affecting the visual articulation strength. Next, the attainable synthesis quality was enhanced by providing the synthesizer with an improved speech database. For this purpose, a new extensive Dutch audiovisual speech database was constructed, containing novel high-quality acoustic and visual recordings of original speech from a single speaker.

In a final part of this work it was investigated how the AVTTS synthesis techniques could be adopted to create a novel visual speech signal matching an original auditory speech signal. For visual-only synthesis, the speech information can be described by means of either phoneme or viseme labels. The attainable synthesis quality using phonemes was compared with the synthesis quality attained using both standardized and speaker-dependent many-to-one phoneme-to-viseme mappings. In addition, novel context-dependent many-to-many phoneme-to-viseme mapping strategies were investigated and evaluated for synthesis. It was found that these novel viseme labels more accurately describe the visual speech information compared to phonemes and that they enhance the attainable synthesis quality in case only a limited amount of original speech data is available.

Logistics

Date: 18.06.2013

Time: 16:00

Location: Room D.2.01 Building D

- Contact person

- IRIS

- AVSP

- LAMI

- Contact person

- Thesis proposals

- ETRO Courses

- Contact person

- Spin-offs

- Know How

- Journals

- Conferences

- Books

- Vacancies

- News

- Events

- Press

Contact

ETRO Department

Tel: +32 2 629 29 30

©2024 • Vrije Universiteit Brussel • ETRO Dept. • Pleinlaan 2 • 1050 Brussels • Tel: +32 2 629 2930 (secretariat) • Fax: +32 2 629 2883 • WebmasterDisclaimer