TY - JOUR
T1 - Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
AU - Azizah, Kurniawati
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep neural network based text-to-speech (TTS) technology has brought advances in speech synthesis approaching the quality of human speech sounds. Zero-shot voice cloning TTS is a system that accepts input in the form of text and a few seconds of a sample of the target speaker's voice to produce speech sound waves similar to the target speaker's voice. Some of the latest zero-shot voice cloning TTS studies still focuses on normal human voices. However, this technology still has limitations for individuals with speech disorders such as dysphonia. We observe that our baseline zero-shot TTS model applied to the dysphonia domain still has poor performance on the following aspects: speaker similarity, intelligibility or clarity of speech, and speech sound quality. This research develops 24 zero-shot voice cloning TTS models to observe which models can improve the baseline model performance on the dysphonia domain. We propose four categories to change the baseline model architecture and setting: input-level text sequences (grapheme, phoneme, or combination of grapheme-phoneme), speaker embedding type (speaker encoder or speaker model), speaker embedding position (at TTS encoder of at TTS encoder and decoder), and loss function (without or with speaker consistency loss). The experimental results show that the best model is the one uses the following configuration settings: a combination of grapheme-phoneme-level text sequences, speaker model as the speaker embedding, placing the speaker embedding at the TTS encoder only, and adding speaker consistency loss to the frame-level speech loss. Compared to the baseline model, our proposed best model is able to improve speaker cosine similarity (COS), speech intelligibility (CER), and speech sound quality (MOS) performance in the domain of dysphonia speech disorders by 0.197, 0.55%, and 0.244, respectively. When compared with the original voice of dysphonia disorder speakers, the best model also increases the speech intelligibility and quality of the speech sounds by 13.45% and 0.22, respectively.
AB - Deep neural network based text-to-speech (TTS) technology has brought advances in speech synthesis approaching the quality of human speech sounds. Zero-shot voice cloning TTS is a system that accepts input in the form of text and a few seconds of a sample of the target speaker's voice to produce speech sound waves similar to the target speaker's voice. Some of the latest zero-shot voice cloning TTS studies still focuses on normal human voices. However, this technology still has limitations for individuals with speech disorders such as dysphonia. We observe that our baseline zero-shot TTS model applied to the dysphonia domain still has poor performance on the following aspects: speaker similarity, intelligibility or clarity of speech, and speech sound quality. This research develops 24 zero-shot voice cloning TTS models to observe which models can improve the baseline model performance on the dysphonia domain. We propose four categories to change the baseline model architecture and setting: input-level text sequences (grapheme, phoneme, or combination of grapheme-phoneme), speaker embedding type (speaker encoder or speaker model), speaker embedding position (at TTS encoder of at TTS encoder and decoder), and loss function (without or with speaker consistency loss). The experimental results show that the best model is the one uses the following configuration settings: a combination of grapheme-phoneme-level text sequences, speaker model as the speaker embedding, placing the speaker embedding at the TTS encoder only, and adding speaker consistency loss to the frame-level speech loss. Compared to the baseline model, our proposed best model is able to improve speaker cosine similarity (COS), speech intelligibility (CER), and speech sound quality (MOS) performance in the domain of dysphonia speech disorders by 0.197, 0.55%, and 0.244, respectively. When compared with the original voice of dysphonia disorder speakers, the best model also increases the speech intelligibility and quality of the speech sounds by 13.45% and 0.22, respectively.
KW - Deep neural network
KW - dysphonia
KW - speaker consistency loss
KW - speaker encoder
KW - speaker model
KW - text-to-speech
KW - voice cloning
KW - zero-shot learning
UR - http://www.scopus.com/inward/record.url?scp=85192200067&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3396377
DO - 10.1109/ACCESS.2024.3396377
M3 - Article
AN - SCOPUS:85192200067
SN - 2169-3536
VL - 12
SP - 63528
EP - 63547
JO - IEEE Access
JF - IEEE Access
ER -