Article

Analysis by Synthesis Assessment of Speech Emotion Perception in Different Languages (en)

* Presenting author
Day / Time: 21.03.2024, 09:00-09:20
Room: Neuer Saal
Typ: Vortrag (strukturierte Sitzung)
Abstract: With the recent advances of neural Text-to-Speech (TTS) technology, high naturalness and controllability of synthetic voice enable diverse industrial and public use-cases, e.g., in the automotive or medical industry, public transportation and emergency broadcast systems. While the current technology can produce very natural and intelligible speech, expressive TTS still poses open challenges, e.g., with regards to synthesizing emotional speech. This study investigates how utterances synthesized in various emotional states are perceived by human listeners. In addition, different languages will be examined to identify the effect of decoders’ linguistic background on perception. As a main contribution, a perceptual listening test is performed, in which participants listen to both natural and synthesized emotional speech and rate them according to several criteria including naturalness, valence and arousal of the voice. The correlation between emotion and perception test can be used as an indicator of how emotions are decoded across diverse languages and assessed by listeners. The insights from this study will be beneficial for synthesizing appropriate emotional speech, allowing to adjust expressiveness according to users’ needs.