Generating Speech with Prosodic Prominence based on SSL-Visually Grounded Models

Bella Septina Ika Hartanti, Dipta Tanaya, Kurniawati Azizah, Dessi Puji Lestari, Ayu Purwarianti, Sakriani Sakti

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Despite many existing works that address expressive speech synthesis with a desired prosody, few have focused on generating speech with prosody prominence. Most previous studies addressing this issue generate speech from given text labels with a contrastive focus emphasizing a specific word. In contrast, this paper investigates whether we can control prosody based on the contrastive focus that appears in images. Given an image and its caption, our system first discovers spoken terms associated with objects or situations in natural images based on a self-supervised visually grounded model. Then it generates speech with prosody prominence based on the contrastive focus of these spoken terms in a way that best describes the images. The framework can perform the task with/without text annotation, making it applicable for untranscribed, unsegmented speech utterances in unknown languages.

Original languageEnglish
Title of host publicationProceedings of 2023 26th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350344028
DOIs
Publication statusPublished - 2023
Event26th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023 - Delhi, India
Duration: 4 Dec 20236 Dec 2023

Publication series

NameProceedings of 2023 26th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023

Conference

Conference26th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques, O-COCOSDA 2023
Country/TerritoryIndia
CityDelhi
Period4/12/236/12/23

Keywords

  • multimodal learning
  • prosody prominence
  • self-supervised visual grounded models
  • text-to-speech synthesis

Fingerprint

Dive into the research topics of 'Generating Speech with Prosodic Prominence based on SSL-Visually Grounded Models'. Together they form a unique fingerprint.

Cite this