TY - GEN
T1 - Machine Speech Chain with Emotion Recognition
AU - Naufal, Akeyla Pradia
AU - Lestari, Dessi Puji
AU - Purwarianti, Ayu
AU - Azizah, Kurniawati
AU - Tanaya, Dipta
AU - Sakti, Sakriani
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Developing natural speech recognition and speech synthesis systems requires speech data that authentically represents real emotions. However, this type of data is often challenging to obtain. Machine speech chain offers a solution to this challenge by using unpaired data to continue training models initially trained with paired data. Given the relative abundance of unpaired data compared to paired data, machine speech chain can be instrumental in recognizing emotions in speech where training data is limited. This study investigates the application of machine speech chain in speech emotion recognition and speech recognition of emotional speech. Our findings indicate that a model trained with 50% paired neutral emotion speech data and 22% paired non-neutral emotional speech data shows a reduction in Character Error Rate (CER) from 37.55% to 34.52% when further trained with unpaired neutral emotion speech data. The CER further decreases to 33.75% when additionally trained with combined unpaired speech data. The accuracy of recognizing non-neutral emotions ranged from 2.18% to 53.51 %, though the F1 score fluctuated, increasing by up to 20.6% and decreasing by up to 23.4%. These results suggest that the model demonstrates a bias towards the majority class, as reflected by the values of the two metrics.
AB - Developing natural speech recognition and speech synthesis systems requires speech data that authentically represents real emotions. However, this type of data is often challenging to obtain. Machine speech chain offers a solution to this challenge by using unpaired data to continue training models initially trained with paired data. Given the relative abundance of unpaired data compared to paired data, machine speech chain can be instrumental in recognizing emotions in speech where training data is limited. This study investigates the application of machine speech chain in speech emotion recognition and speech recognition of emotional speech. Our findings indicate that a model trained with 50% paired neutral emotion speech data and 22% paired non-neutral emotional speech data shows a reduction in Character Error Rate (CER) from 37.55% to 34.52% when further trained with unpaired neutral emotion speech data. The CER further decreases to 33.75% when additionally trained with combined unpaired speech data. The accuracy of recognizing non-neutral emotions ranged from 2.18% to 53.51 %, though the F1 score fluctuated, increasing by up to 20.6% and decreasing by up to 23.4%. These results suggest that the model demonstrates a bias towards the majority class, as reflected by the values of the two metrics.
KW - machine speech chain
KW - speech emotion recognition
KW - speech recognition
KW - unpaired data
UR - http://www.scopus.com/inward/record.url?scp=85214672490&partnerID=8YFLogxK
U2 - 10.1109/ICAICTA63815.2024.10763258
DO - 10.1109/ICAICTA63815.2024.10763258
M3 - Conference contribution
AN - SCOPUS:85214672490
T3 - 2024 11th International Conference on Advanced Informatics: Concept, Theory and Application, ICAICTA 2024
BT - 2024 11th International Conference on Advanced Informatics
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 11th International Conference on Advanced Informatics: Concept, Theory and Application, ICAICTA 2024
Y2 - 28 September 2024 through 30 September 2024
ER -