CANONICAL SEGMENTATION FOR JAVANESE-INDONESIAN NEURAL MACHINE TRANSLATION

Sri Hartati Wijono, Kurniawati Azizah, Wisnu Jatmiko

Research output: Contribution to journalArticlepeer-review

Abstract

Corpus-based Neural Machine Translation (NMT) has achieved remarkable results in many high-resource language pairs and becomes widely used in recent years. However, it generates many out-of-vocabulary (OOV) words in the low-resource parallel corpus, especially for agglutinative language pairs such as Javanese to Indonesian translation. This paper proposes a canonical word segmentation and a linguistic feature tag to be incorporated in a Transformer-based NMT for translating Javanese into Indonesian. The word segmentation is to increase vocabulary frequency of affixed words that rarely appear, while the feature tag is to help the learning process and generates translation output. This research is conducted in two stages. First, we explore some Javanese segmentation approaches using a Transformer-based encoder-decoder to find the best segmentation model. As for the Indonesian language, we use MorphInd to do corpus segmentation. Second, we conduct experiments on NMT by applying canonical segmentation and feature tag resulted in the first stage as the input to the encoder and decoder. Our experiments show that the best canonical segmentation is the one that uses character-level inputs concatenated with feature tags that includes affixes and root words. It achieves an accuracy value of 84.20% of all occurrences and 56.09% of canonical segmentation. This study also reports that it reaches a F1 score of 92.78% and 96.35% for all words and canonical segmentation, respectively. As for the NMT experiments, the results show that the proposed canonical segmentation and affixes/root word feature tag applied to NMT model improves the translation performance. Our best model increases the BLEU score by 3.55 points compared to baseline model using words as inputs. It also increases as much as 2.57 BLEU points compared to baseline model using BPE segmentation.

Original languageEnglish
Pages (from-to)62-68
Number of pages7
JournalJournal of Engineering Science and Technology
Volume18
Publication statusPublished - Aug 2023

Keywords

  • Canonical segmentation
  • Javanese-Indonesian NMT
  • Linguistic feature tag
  • Neural machine translation.

Fingerprint

Dive into the research topics of 'CANONICAL SEGMENTATION FOR JAVANESE-INDONESIAN NEURAL MACHINE TRANSLATION'. Together they form a unique fingerprint.

Cite this