TY - GEN
T1 - A Comprehensive Exploration of Fine-Tuning WavLM for Enhancing Speech Emotion Recognition
AU - Ali, Fadel
AU - Arymurthy, Aniati Murni
AU - Prasojo, Radityo Eko
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Speech Emotion Recognition (SER) is a pivotal area in Human-Computer Interaction (HCI) with numerous applications. Traditional SER models rely on supervised learning but face challenges due to limited labeled data and the subjective nature of emotions. Self-supervised learning (SSL) offers an alternative by leveraging unlabeled audio data. WavLM, a large-scale SSL audio model, has shown promise in various speech processing tasks. This paper investigates the fine-Tuning of WavLM for SER, analyzing the impact of different segment of layers on performance. We conducted experiments on the IEMOCAP and RAVDESS datasets, comparing various fine-Tuned WavLM models with Wav2Vec 2.0 and HuBERT as SSL baselines. Results reveal that WavLM outperforms SSL base-lines, with all-layer fine-Tuned WavLM achieving state-of-The-Art results. Interestingly, fine-Tuning top layers significantly enhances performance, suggesting their role in encoding paralinguistic information. However, fine-Tuning all layers remains superior. These findings shed light on optimizing SSL audio models for SER and highlight WavLM's potential in emotion recognition.
AB - Speech Emotion Recognition (SER) is a pivotal area in Human-Computer Interaction (HCI) with numerous applications. Traditional SER models rely on supervised learning but face challenges due to limited labeled data and the subjective nature of emotions. Self-supervised learning (SSL) offers an alternative by leveraging unlabeled audio data. WavLM, a large-scale SSL audio model, has shown promise in various speech processing tasks. This paper investigates the fine-Tuning of WavLM for SER, analyzing the impact of different segment of layers on performance. We conducted experiments on the IEMOCAP and RAVDESS datasets, comparing various fine-Tuned WavLM models with Wav2Vec 2.0 and HuBERT as SSL baselines. Results reveal that WavLM outperforms SSL base-lines, with all-layer fine-Tuned WavLM achieving state-of-The-Art results. Interestingly, fine-Tuning top layers significantly enhances performance, suggesting their role in encoding paralinguistic information. However, fine-Tuning all layers remains superior. These findings shed light on optimizing SSL audio models for SER and highlight WavLM's potential in emotion recognition.
KW - audio model
KW - fine-Tuning
KW - self-supervised learning
KW - Speech emotion recognition
KW - WavLM
UR - http://www.scopus.com/inward/record.url?scp=85190069968&partnerID=8YFLogxK
U2 - 10.1109/ISRITI60336.2023.10467733
DO - 10.1109/ISRITI60336.2023.10467733
M3 - Conference contribution
AN - SCOPUS:85190069968
T3 - 6th International Seminar on Research of Information Technology and Intelligent Systems, ISRITI 2023 - Proceeding
SP - 295
EP - 300
BT - 6th International Seminar on Research of Information Technology and Intelligent Systems, ISRITI 2023 - Proceeding
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th International Seminar on Research of Information Technology and Intelligent Systems, ISRITI 2023
Y2 - 11 December 2023
ER -