Hate speech detection in the Indonesian language: A dataset and preliminary study

Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, Yudo Ekanata

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

150 Citations (Scopus)

Abstract

The objective of our work is to detect hate speech in the Indonesian language. As far as we know, the research on this subject is still very rare. The only research we found has created a dataset for hate speech against religion, but the quality of this dataset is inadequate. Our research aimed to create a new dataset that covers hate speech in general, including hatred for religion, race, ethnicity, and gender. In addition, we also conducted a preliminary study using machine learning approach. Machine learning so far is the most frequently used approach in classifying text. We compared the performance of several features and machine learning algorithms for hate speech detection. Features that extracted were word n-gram with n=l and n=2, character n-gram with n=3 and n=4, and negative sentiment. The classification was performed using Naïve Bayes, Support Vector Machine, Bayesian Logistic Regression, and Random Forest Decision Tree. An F-measure of 93.5% was achieved when using word n-gram feature with Random Forest Decision Tree algorithm. Results also show that word n-gram feature outperformed character n-gram.

Original languageEnglish
Title of host publication2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages233-237
Number of pages5
ISBN (Electronic)9781538631720
DOIs
Publication statusPublished - 2 Jul 2017
Event9th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017 - Jakarta, Indonesia
Duration: 28 Oct 201729 Oct 2017

Publication series

Name2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017
Volume2018-January

Conference

Conference9th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017
Country/TerritoryIndonesia
CityJakarta
Period28/10/1729/10/17

Keywords

  • building dataset
  • classification
  • hate speech detection
  • machine learning

Fingerprint

Dive into the research topics of 'Hate speech detection in the Indonesian language: A dataset and preliminary study'. Together they form a unique fingerprint.

Cite this