Statistical Machine Translation Approach for Lexical Normalization on Indonesian Text

Ajmal Kurnia, Evi Yulianti

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)

Abstract

Lexical normalization is an important task to be performed on noisy data, such as social media posts, before using the data for further analysis. We examine the potential of Statistical Machine Translation (SMT) for normalization of Indonesian text using the translation unit on both phrase and character levels. We also used an external corpus to generate additional language model data and pre-normalization rules to enhance the SMT system. The result shows the SMT systems on both phrase and character levels are outperforming various baseline in Word Error Rate (WER) score and Bilingual Understudy Evaluation (BLEU) score. This research also demonstrates the effect of using an external language model and applying pre-normalization rules can further enhance the effectiveness of SMT systems in normalizing Indonesian text.

Original languageEnglish
Title of host publication2020 International Conference on Asian Language Processing, IALP 2020
EditorsYanfeng Lu, Minghui Dong, Lay-Ki Soon, Keng Hoon Gan
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages288-293
Number of pages6
ISBN (Electronic)9781728176895
DOIs
Publication statusPublished - 4 Dec 2020
Event2020 International Conference on Asian Language Processing, IALP 2020 - Kuala Lumpur, Malaysia
Duration: 4 Dec 20206 Dec 2020

Publication series

Name2020 International Conference on Asian Language Processing, IALP 2020

Conference

Conference2020 International Conference on Asian Language Processing, IALP 2020
Country/TerritoryMalaysia
CityKuala Lumpur
Period4/12/206/12/20

Keywords

  • Indonesian
  • Lexical normalization
  • machine translation
  • social media

Fingerprint

Dive into the research topics of 'Statistical Machine Translation Approach for Lexical Normalization on Indonesian Text'. Together they form a unique fingerprint.

Cite this