Normalization of indonesian-english code-mixed twitter data

Anab Maulana Barik, Rahmad Mahendra, Mirna Adriani

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

38 Citations (Scopus)

Abstract

Twitter is an excellent source of data for NLP researches as it offers a tremendous amount of textual data. However, processing tweet to extract meaningful information is very challenging, at least for two reasons: (i) using nonstandard words as well as informal writing manner, and (ii) code-mixing issues, which is combining multiple languages in single tweet conversation. Most of the previous works have addressed both issues in isolated different task. In this study, we work on normalization task in code-mixed Twitter data, more specifically in Indonesian-English language. We propose a pipeline that consists of four modules, i.e tokenization, language identification, lexical normalization, and translation. Another contribution is to provide a gold standard of Indonesian-English code-mixed data for each module.

Original languageEnglish
Title of host publicationW-NUT@EMNLP 2019 - 5th Workshop on Noisy User-Generated Text, Proceedings
PublisherAssociation for Computational Linguistics (ACL)
Pages417-424
Number of pages8
ISBN (Electronic)9781950737840
Publication statusPublished - 2019
Event5th Workshop on Noisy User-Generated Text, W-NUT@EMNLP 2019 - Hong Kong, China
Duration: 4 Nov 2019 → …

Publication series

NameW-NUT@EMNLP 2019 - 5th Workshop on Noisy User-Generated Text, Proceedings

Conference

Conference5th Workshop on Noisy User-Generated Text, W-NUT@EMNLP 2019
Country/TerritoryChina
CityHong Kong
Period4/11/19 → …

Fingerprint

Dive into the research topics of 'Normalization of indonesian-english code-mixed twitter data'. Together they form a unique fingerprint.

Cite this