TY - GEN
T1 - Multi-label Classification for Hate Speech and Abusive Language in Indonesian-Local Languages
AU - Asti, Ajeng Dwi
AU - Budi, Indra
AU - Ibrohim, Muhammad Okky
N1 - Funding Information:
The authors gratefully thanks Universitas Indonesia for the International Publication Grants (PUTI Q2) No. NKB-1475/UN2.RST/HKP.05.00/2020.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Each hate speech has a target, category, and level that needs to be detected to help the authorities prioritize hate speech cases that need to be solved first. Various studies have been conducted in Indonesia on abusive speech and hate speech and their targets, categories, and levels, but only in Indonesian and English. On the other hand, various local languages in Indonesia open up opportunities for hate speech to occur using the local language. This study aims to compare some of the best machine learning algorithms, transformation methods, and feature extraction techniques in classifying abusive language and hate speech and their targets, categories, and levels using Twitter data in Indonesian and local languages. This study uses five local languages in Indonesia with the most speakers: Javanese, Sundanese, Madurese, Minangkabau, and Musi (Palembang). The algorithms used are Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), and Random Forest Decision Tree (RFDT) with Binary Relevance (BR), Classifier Chains (CC), and Label Powerset (LP) as transformation methods. The term weighting used in this study is TF-IDF with word n-gram and char n-gram features. The results showed that the SVM algorithm with the CC transformation method and unigram feature extraction gave the highest F1-score results, 66.33% for Javanese and 65.68% for Sundanese. In Madurese, Minangkabau, and Musi language data, the best F1-score was obtained using the RFDT algorithm with the CC transformation method and unigram feature extraction with F1-score 76.37%, 80.75%, and 77.34%.
AB - Each hate speech has a target, category, and level that needs to be detected to help the authorities prioritize hate speech cases that need to be solved first. Various studies have been conducted in Indonesia on abusive speech and hate speech and their targets, categories, and levels, but only in Indonesian and English. On the other hand, various local languages in Indonesia open up opportunities for hate speech to occur using the local language. This study aims to compare some of the best machine learning algorithms, transformation methods, and feature extraction techniques in classifying abusive language and hate speech and their targets, categories, and levels using Twitter data in Indonesian and local languages. This study uses five local languages in Indonesia with the most speakers: Javanese, Sundanese, Madurese, Minangkabau, and Musi (Palembang). The algorithms used are Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), and Random Forest Decision Tree (RFDT) with Binary Relevance (BR), Classifier Chains (CC), and Label Powerset (LP) as transformation methods. The term weighting used in this study is TF-IDF with word n-gram and char n-gram features. The results showed that the SVM algorithm with the CC transformation method and unigram feature extraction gave the highest F1-score results, 66.33% for Javanese and 65.68% for Sundanese. In Madurese, Minangkabau, and Musi language data, the best F1-score was obtained using the RFDT algorithm with the CC transformation method and unigram feature extraction with F1-score 76.37%, 80.75%, and 77.34%.
KW - hate speech
KW - Indonesian local language
KW - multi-label classification
KW - Twitter
UR - http://www.scopus.com/inward/record.url?scp=85123853559&partnerID=8YFLogxK
U2 - 10.1109/ICACSIS53237.2021.9631316
DO - 10.1109/ICACSIS53237.2021.9631316
M3 - Conference contribution
AN - SCOPUS:85123853559
T3 - 2021 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2021
BT - 2021 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2021
Y2 - 23 October 2021 through 26 October 2021
ER -