A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media

Muhammad Okky Ibrohim, Indra Budi

Research output: Contribution to journalConference articlepeer-review

28 Citations (Scopus)

Abstract

Abusive language is an expression (both oral or text) that contains abusive/dirty words or phrases both in the context of jokes, a vulgar sex conservation or to cursing someone. Nowadays many people on the internet (netizens) write and post an abusive language in the social media such as Facebook, Line, Twitter, etc. Detecting an abusive language in social media is a difficult problem to resolve because this problem can not be resolved just use word matching. This paper discusses a preliminaries study for abusive language detection in Indonesian social media and the challenge in developing a system for Indonesian abusive language detection, especially in social media. We also built reported an experiment for abusive language detection on Indonesian tweet using machine learning approach with a simple word n-gram and char n-gram features. We use Naive Bayes, Support Vector Machine, and Random Forest Decision Tree classifier to identify the tweet whether the tweet is a not abusive language, abusive but not offensive, or offensive language. The experiment results show that the Naive Bayes classifier with the combination of word unigram + bigrams features gives the best result i.e. 70.06% of F1 - Score. However, if we classifying the tweet into two labels only (not abusive language and abusive language), all classifier that we used gives a higher result (more than 83% of F1 - Score for every classifier). The dataset in this experiment is available for other researchers that interest to improved this study.

Original languageEnglish
Pages (from-to)222-229
Number of pages8
JournalProcedia Computer Science
Volume135
DOIs
Publication statusPublished - 1 Jan 2018
Event3rd International Conference on Computer Science and Computational Intelligence, ICCSCI 2018 - Tangerang, Indonesia
Duration: 7 Sep 20188 Sep 2018

Keywords

  • abusive language
  • machine learning
  • twitter

Fingerprint

Dive into the research topics of 'A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media'. Together they form a unique fingerprint.

Cite this