TY - JOUR
T1 - Analysis and Mitigation of Religion Bias in Indonesian Natural Language Processing Datasets
AU - Saptawijaya, Ari
PY - 2023/8/12
Y1 - 2023/8/12
N2 - Previous studies have shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) datasets have been recorded to inflict harm against such marginalized identities in cases such as automated content moderation, and as such must be mitigated. In this paper, we analyze, for the first time, several Indonesian NLP datasets to see whether they contain unwanted bias and the effects of debiasing on them. We find that two of the three data sets analyzed in this study contain unwanted bias, whose effects trickle down to downstream performance in the form of allocation and representation harm. The results of debiasing at the dataset level, as a response to the biases previously discovered, are consistently positive for the respective dataset. However, depending on the data set and embedding used to train the model, they vary greatly at the downstream performance level. In particular, the same debiasing technique can decrease bias on a combination of datasets and embedding, yet increase bias on another, particularly in the case of representation harm.
AB - Previous studies have shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) datasets have been recorded to inflict harm against such marginalized identities in cases such as automated content moderation, and as such must be mitigated. In this paper, we analyze, for the first time, several Indonesian NLP datasets to see whether they contain unwanted bias and the effects of debiasing on them. We find that two of the three data sets analyzed in this study contain unwanted bias, whose effects trickle down to downstream performance in the form of allocation and representation harm. The results of debiasing at the dataset level, as a response to the biases previously discovered, are consistently positive for the respective dataset. However, depending on the data set and embedding used to train the model, they vary greatly at the downstream performance level. In particular, the same debiasing technique can decrease bias on a combination of datasets and embedding, yet increase bias on another, particularly in the case of representation harm.
UR - http://www.jurnal.iaii.or.id/index.php/RESTI/article/view/5035
U2 - 10.29207/resti.v7i4.5035
DO - 10.29207/resti.v7i4.5035
M3 - Article
SN - 2580-0760
VL - 7
SP - 845
EP - 857
JO - Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
JF - Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
IS - 4
ER -