Evaluation of Prosody Feature Normalization in Text-to-Speech Synthesis (en)
* Presenting author
Abstract:
Recent neural text-to-speech models achieve a high speech quality. In addition, it can be desirable to have control over the prosodic features (pitch trajectory, speech rate, and energy) of the synthesized speech. In prior text-to-speech systems, such as FastTacotron, the addition of prosody prediction modules was proposed. The purpose of these modules is to estimate prosodic features, which are then used by the acoustic model to predict mel-spectrograms exhibiting the desired prosody characteristics. A benefit of these predictors is the option to change the prosody by modifying the module's output. In our work, we train multi-speaker English acoustic models with prosody prediction modules. For the evaluation, we investigate to which extent the modification of prosodic features is reflected in the synthesized speech and within which ranges modifications are possible. Additionally, we analyze how pitch and energy features are entangled. In previous approaches, prosodic features are often normalized. With a multi-speaker model, there are several possibilities regarding the normalization: The features can be unnormalized, normalized using separate statistics for each speaker, or normalized using statistics derived from the complete dataset. As our main contribution, we investigate how different normalization methods perform regarding their controllability of pitch in the synthesized audio samples.