Building Indonesian local language detection tools using Wikipedia data

Puji Martadinata, Bayu Distiawan Trisedya, Hisar Maruli Manurung, Mirna Adriani

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)


The widespread use of social media today has generated lots of research interest towards information retrieval, natural language processing, and also machine learning. The vast diversity of languages used on social media creates the need for accurate automated language identification tools. In this research, we develop a language identification tool that can help automatically identify social media posts in Indonesian, Javanese, Sundanese, and Minangkabau. The latter three are some of the most widely spoken regional languages in Indonesia. We conducted experiments to compare three popular methods used to develop language identification tools, namely N-grams, statistical models, and the Small Words technique. Our experiments conducted using articles on internet for training and tested using social media data that we constructed, show that the statistical method obtains the best result among all the methods used.

Original languageEnglish
Title of host publicationWorldwide Language Service Infrastructure - 2nd International Workshop, WLSI 2015, Revised Selected Papers
EditorsDonghui Lin, Yohei Murakami
PublisherSpringer Verlag
Number of pages11
ISBN (Print)9783319314679
Publication statusPublished - 2016
Event2nd International Workshop on Worldwide Language Service Infrastructure, WLSI 2015 - Kyoto, Japan
Duration: 22 Jan 201523 Jan 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference2nd International Workshop on Worldwide Language Service Infrastructure, WLSI 2015


  • Language identification
  • Language model
  • N-gram
  • Statistical method
  • Twitter
  • Wikipedia


Dive into the research topics of 'Building Indonesian local language detection tools using Wikipedia data'. Together they form a unique fingerprint.

Cite this