NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

42 Citations (Scopus)

Abstract

Natural language processing (NLP) has significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Mandarin Chinese, and remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages of Indonesia. Despite being the second most linguistically-diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes sentiment and machine translation datasets, and bilingual lexicons. We provide extensive analysis, and describe challenges for creating such resources. Our hope is that this work will spark more NLP research on Indonesian and other underrepresented languages.

Original languageEnglish
Title of host publicationEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages815-834
Number of pages20
ISBN (Electronic)9781959429449
Publication statusPublished - 2023
Event17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Croatia
Duration: 2 May 20236 May 2023

Publication series

NameEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/TerritoryCroatia
CityDubrovnik
Period2/05/236/05/23

Fingerprint

Dive into the research topics of 'NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages'. Together they form a unique fingerprint.

Cite this