TY - GEN
T1 - Normalization of indonesian-english code-mixed twitter data
AU - Barik, Anab Maulana
AU - Mahendra, Rahmad
AU - Adriani, Mirna
N1 - Funding Information:
The authors acknowledge the support of Universitas Indonesia through Hibah PITTA B 2019 Pen-golahan Teks dan Musik pada Sistem Community Question Answering dan Temporal Information Retrieval.
Publisher Copyright:
© 2019 Association for Computational Linguistics
PY - 2019
Y1 - 2019
N2 - Twitter is an excellent source of data for NLP researches as it offers a tremendous amount of textual data. However, processing tweet to extract meaningful information is very challenging, at least for two reasons: (i) using nonstandard words as well as informal writing manner, and (ii) code-mixing issues, which is combining multiple languages in single tweet conversation. Most of the previous works have addressed both issues in isolated different task. In this study, we work on normalization task in code-mixed Twitter data, more specifically in Indonesian-English language. We propose a pipeline that consists of four modules, i.e tokenization, language identification, lexical normalization, and translation. Another contribution is to provide a gold standard of Indonesian-English code-mixed data for each module.
AB - Twitter is an excellent source of data for NLP researches as it offers a tremendous amount of textual data. However, processing tweet to extract meaningful information is very challenging, at least for two reasons: (i) using nonstandard words as well as informal writing manner, and (ii) code-mixing issues, which is combining multiple languages in single tweet conversation. Most of the previous works have addressed both issues in isolated different task. In this study, we work on normalization task in code-mixed Twitter data, more specifically in Indonesian-English language. We propose a pipeline that consists of four modules, i.e tokenization, language identification, lexical normalization, and translation. Another contribution is to provide a gold standard of Indonesian-English code-mixed data for each module.
UR - http://www.scopus.com/inward/record.url?scp=85095495571&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85095495571
T3 - W-NUT@EMNLP 2019 - 5th Workshop on Noisy User-Generated Text, Proceedings
SP - 417
EP - 424
BT - W-NUT@EMNLP 2019 - 5th Workshop on Noisy User-Generated Text, Proceedings
PB - Association for Computational Linguistics (ACL)
T2 - 5th Workshop on Noisy User-Generated Text, W-NUT@EMNLP 2019
Y2 - 4 November 2019
ER -