Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

Kurniawati Azizah, Wisnu Jatmiko

Research output: Contribution to journalArticlepeer-review

Abstract

Deep neural network (DNN)-based systems generally require large amounts of training data, so they have data scarcity problems in low-resource languages. Recent studies have succeeded in building zero-shot multi-speaker DNN-based TTS on high-resource languages, but they still have unsatisfactory performance on unseen speakers. This study addresses two main problems: overcoming the problem of data scarcity in the DNN-based TTS on low-resource languages and improving the performance of zero-shot speaker adaptation for unseen speakers. We propose a novel multi-stage transfer learning strategy using a partial network-based deep transfer learning to overcome the low-resource problem by utilizing pre-trained monolingual single-speaker TTS and d-vector speaker encoder on a high-resource language as the source domain. Meanwhile, to improve the performance of zero-shot speaker adaptation, we propose a new TTS model that incorporates an explicit style control from the target speaker for TTS conditioning and an utterance-level speaker reconstruction loss during TTS training. We use publicly available speech datasets for experiments. We show that our proposed training strategy is able to effectively train the TTS models using a limited amount of training data of low-resource target languages. The models trained using the proposed transfer learning successfully produce intelligible natural speech sounds, while in contrast using standard training fails to make the models synthesize understandable speech. We also demonstrate that our proposed style encoder network and speaker reconstruction loss significantly improves speaker similarity in zero-shot speaker adaptation task compared to the baseline model. Overall, our proposed TTS model and training strategy has succeeded in increasing the speaker cosine similarity of the synthesized speech on the unseen speakers test set by 0.468 and 0.266 in native and foreign languages respectively.

Original languageEnglish
Pages (from-to)5895-5911
Number of pages17
JournalIEEE Access
Volume10
DOIs
Publication statusPublished - 7 Jan 2022

Keywords

  • Adaptation models
  • Data models
  • Deep learning
  • deep neural network
  • low-resource
  • multi-speaker
  • multilingual
  • partial network-based deep transfer learning
  • speaker reconstruction loss
  • style control
  • Task analysis
  • text-to-speech
  • Training
  • Training data
  • Transfer learning
  • zero-shot speaker adaptation

Fingerprint

Dive into the research topics of 'Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages'. Together they form a unique fingerprint.

Cite this