Analysis and Mitigation of Religion Bias in Indonesian Natural Language Processing Datasets

Research output: Contribution to journalArticlepeer-review

Abstract

Previous studies have shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) datasets have been recorded to inflict harm against such marginalized identities in cases such as automated content moderation, and as such must be mitigated. In this paper, we analyze, for the first time, several Indonesian NLP datasets to see whether they contain unwanted bias and the effects of debiasing on them. We find that two of the three data sets analyzed in this study contain unwanted bias, whose effects trickle down to downstream performance in the form of allocation and representation harm. The results of debiasing at the dataset level, as a response to the biases previously discovered, are consistently positive for the respective dataset. However, depending on the data set and embedding used to train the model, they vary greatly at the downstream performance level. In particular, the same debiasing technique can decrease bias on a combination of datasets and embedding, yet increase bias on another, particularly in the case of representation harm.
Original languageEnglish
Pages (from-to)845 - 857
JournalJurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
Volume7
Issue number4
DOIs
Publication statusPublished - 12 Aug 2023

Fingerprint

Dive into the research topics of 'Analysis and Mitigation of Religion Bias in Indonesian Natural Language Processing Datasets'. Together they form a unique fingerprint.

Cite this