TY - GEN
T1 - Generating Speech with Prosodic Prominence based on SSL-Visually Grounded Models
AU - Ika Hartanti, Bella Septina
AU - Tanaya, Dipta
AU - Azizah, Kurniawati
AU - Lestari, Dessi Puji
AU - Purwarianti, Ayu
AU - Sakti, Sakriani
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Despite many existing works that address expressive speech synthesis with a desired prosody, few have focused on generating speech with prosody prominence. Most previous studies addressing this issue generate speech from given text labels with a contrastive focus emphasizing a specific word. In contrast, this paper investigates whether we can control prosody based on the contrastive focus that appears in images. Given an image and its caption, our system first discovers spoken terms associated with objects or situations in natural images based on a self-supervised visually grounded model. Then it generates speech with prosody prominence based on the contrastive focus of these spoken terms in a way that best describes the images. The framework can perform the task with/without text annotation, making it applicable for untranscribed, unsegmented speech utterances in unknown languages.
AB - Despite many existing works that address expressive speech synthesis with a desired prosody, few have focused on generating speech with prosody prominence. Most previous studies addressing this issue generate speech from given text labels with a contrastive focus emphasizing a specific word. In contrast, this paper investigates whether we can control prosody based on the contrastive focus that appears in images. Given an image and its caption, our system first discovers spoken terms associated with objects or situations in natural images based on a self-supervised visually grounded model. Then it generates speech with prosody prominence based on the contrastive focus of these spoken terms in a way that best describes the images. The framework can perform the task with/without text annotation, making it applicable for untranscribed, unsegmented speech utterances in unknown languages.
KW - multimodal learning
KW - prosody prominence
KW - self-supervised visual grounded models
KW - text-to-speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85190534577&partnerID=8YFLogxK
U2 - 10.1109/O-COCOSDA60357.2023.10482965
DO - 10.1109/O-COCOSDA60357.2023.10482965
M3 - Conference contribution
AN - SCOPUS:85190534577
T3 - Proceedings of 2023 26th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023
BT - Proceedings of 2023 26th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023
Y2 - 4 December 2023 through 6 December 2023
ER -