Article

Lyrics Transcription with Whisper - Evaluating a Large Automatic Speech Recognition Model with Music (en)

* Presenting author
Day / Time: 20.03.2024, 17:20-17:40
Room: Neuer Saal
Typ: Vortrag (strukturierte Sitzung)
Abstract: Whisper is a state-of-the-art open-source automatic speech recognition (ASR) system based on a transformer model and trained on 680000 hours of speech in a supervised way. In this contribution, we evaluate Whisper for lyrics transcription, a task that differs considerably from speech transcription, due to a larger variety in salience, rhythm, and fundamental frequency of singing compared to speech. To investigate the abilities and limitations of the system for a wide range of music styles and recording circumstances, we transcribe lyrics for two datasets with reference annotations: the "Schubert Winterreise" dataset (comprising nine commercial recordings of a 24-song cycle by Franz Schubert) and the "Larynx Microphone Singer-Songwriter" dataset (a collection of twelve multi-track cover pop song recordings with guitar accompaniment). By analyzing the transcription results for various songs, versions, and recording conditions, we can make several observations. In particular, Whisper achieves a notably low word error rate (WER) for both datasets, demonstrating its suitability for a task it has not been trained on explicitly. However, the WER depends on various factors, including signal quality and voice salience, while some sporadic transcription artifacts (e.g., copyright notices) give insight into the presence and effects of music in the training data.