Text Normalization on Code-Mixed Twitter Text using Language Detection

Rafi Dwi Rizqullah, Indra Budi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Presence of code-mixed language become a challenge for NLP research that focused on Twitter text normalization. Some challenges include normalize text containing words with more than one language. But recent method for text normalization still has problems related to language, either on identifying language or normalize a word. This research report covers the solution that can be given to overcome those problems. The approach is using language detection module alongside with transformer model. A BERT model tagger was used as a language detection, and two ByT5 models was used as a normalization. The research shows that proposed method has ERR score 1.01 percent lower than baseline.

Original languageEnglish
Title of host publication2022 7th International Conference on Informatics and Computing, ICIC 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350345711
DOIs
Publication statusPublished - 2022
Event7th International Conference on Informatics and Computing, ICIC 2022 - Virtual, Online, Indonesia
Duration: 8 Dec 20229 Dec 2022

Publication series

Name2022 7th International Conference on Informatics and Computing, ICIC 2022

Conference

Conference7th International Conference on Informatics and Computing, ICIC 2022
Country/TerritoryIndonesia
CityVirtual, Online
Period8/12/229/12/22

Keywords

  • code-mixed language
  • language detection
  • multilingual
  • text normalization
  • twitter

Fingerprint

Dive into the research topics of 'Text Normalization on Code-Mixed Twitter Text using Language Detection'. Together they form a unique fingerprint.

Cite this