TY - GEN
T1 - Building Indonesian local language detection tools using Wikipedia data
AU - Martadinata, Puji
AU - Trisedya, Bayu Distiawan
AU - Manurung, Hisar Maruli
AU - Adriani, Mirna
N1 - Publisher Copyright:
© Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - The widespread use of social media today has generated lots of research interest towards information retrieval, natural language processing, and also machine learning. The vast diversity of languages used on social media creates the need for accurate automated language identification tools. In this research, we develop a language identification tool that can help automatically identify social media posts in Indonesian, Javanese, Sundanese, and Minangkabau. The latter three are some of the most widely spoken regional languages in Indonesia. We conducted experiments to compare three popular methods used to develop language identification tools, namely N-grams, statistical models, and the Small Words technique. Our experiments conducted using articles on internet for training and tested using social media data that we constructed, show that the statistical method obtains the best result among all the methods used.
AB - The widespread use of social media today has generated lots of research interest towards information retrieval, natural language processing, and also machine learning. The vast diversity of languages used on social media creates the need for accurate automated language identification tools. In this research, we develop a language identification tool that can help automatically identify social media posts in Indonesian, Javanese, Sundanese, and Minangkabau. The latter three are some of the most widely spoken regional languages in Indonesia. We conducted experiments to compare three popular methods used to develop language identification tools, namely N-grams, statistical models, and the Small Words technique. Our experiments conducted using articles on internet for training and tested using social media data that we constructed, show that the statistical method obtains the best result among all the methods used.
KW - Language identification
KW - Language model
KW - N-gram
KW - Statistical method
KW - Twitter
KW - Wikipedia
UR - http://www.scopus.com/inward/record.url?scp=84961712303&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-31468-6_8
DO - 10.1007/978-3-319-31468-6_8
M3 - Conference contribution
AN - SCOPUS:84961712303
SN - 9783319314679
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 113
EP - 123
BT - Worldwide Language Service Infrastructure - 2nd International Workshop, WLSI 2015, Revised Selected Papers
A2 - Lin, Donghui
A2 - Murakami, Yohei
PB - Springer Verlag
T2 - 2nd International Workshop on Worldwide Language Service Infrastructure, WLSI 2015
Y2 - 22 January 2015 through 23 January 2015
ER -