A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media

Muhammad Okky Ibrohim, Indra Budi

Research output: Contribution to journalConference articlepeer-review

80 Citations (Scopus)


Abusive language is an expression (both oral or text) that contains abusive/dirty words or phrases both in the context of jokes, a vulgar sex conservation or to cursing someone. Nowadays many people on the internet (netizens) write and post an abusive language in the social media such as Facebook, Line, Twitter, etc. Detecting an abusive language in social media is a difficult problem to resolve because this problem can not be resolved just use word matching. This paper discusses a preliminaries study for abusive language detection in Indonesian social media and the challenge in developing a system for Indonesian abusive language detection, especially in social media. We also built reported an experiment for abusive language detection on Indonesian tweet using machine learning approach with a simple word n-gram and char n-gram features. We use Naive Bayes, Support Vector Machine, and Random Forest Decision Tree classifier to identify the tweet whether the tweet is a not abusive language, abusive but not offensive, or offensive language. The experiment results show that the Naive Bayes classifier with the combination of word unigram + bigrams features gives the best result i.e. 70.06% of F1 - Score. However, if we classifying the tweet into two labels only (not abusive language and abusive language), all classifier that we used gives a higher result (more than 83% of F1 - Score for every classifier). The dataset in this experiment is available for other researchers that interest to improved this study.

Original languageEnglish
Pages (from-to)222-229
Number of pages8
JournalProcedia Computer Science
Publication statusPublished - 2018
Event3rd International Conference on Computer Science and Computational Intelligence, ICCSCI 2018 - Tangerang, Indonesia
Duration: 7 Sept 20188 Sept 2018


  • abusive language
  • machine learning
  • twitter


Dive into the research topics of 'A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media'. Together they form a unique fingerprint.

Cite this