TY - JOUR
T1 - CANONICAL SEGMENTATION FOR JAVANESE-INDONESIAN NEURAL MACHINE TRANSLATION
AU - Wijono, Sri Hartati
AU - Azizah, Kurniawati
AU - Jatmiko, Wisnu
N1 - Publisher Copyright:
© School of Engineering, Taylor’s University.
PY - 2023/8
Y1 - 2023/8
N2 - Corpus-based Neural Machine Translation (NMT) has achieved remarkable results in many high-resource language pairs and becomes widely used in recent years. However, it generates many out-of-vocabulary (OOV) words in the low-resource parallel corpus, especially for agglutinative language pairs such as Javanese to Indonesian translation. This paper proposes a canonical word segmentation and a linguistic feature tag to be incorporated in a Transformer-based NMT for translating Javanese into Indonesian. The word segmentation is to increase vocabulary frequency of affixed words that rarely appear, while the feature tag is to help the learning process and generates translation output. This research is conducted in two stages. First, we explore some Javanese segmentation approaches using a Transformer-based encoder-decoder to find the best segmentation model. As for the Indonesian language, we use MorphInd to do corpus segmentation. Second, we conduct experiments on NMT by applying canonical segmentation and feature tag resulted in the first stage as the input to the encoder and decoder. Our experiments show that the best canonical segmentation is the one that uses character-level inputs concatenated with feature tags that includes affixes and root words. It achieves an accuracy value of 84.20% of all occurrences and 56.09% of canonical segmentation. This study also reports that it reaches a F1 score of 92.78% and 96.35% for all words and canonical segmentation, respectively. As for the NMT experiments, the results show that the proposed canonical segmentation and affixes/root word feature tag applied to NMT model improves the translation performance. Our best model increases the BLEU score by 3.55 points compared to baseline model using words as inputs. It also increases as much as 2.57 BLEU points compared to baseline model using BPE segmentation.
AB - Corpus-based Neural Machine Translation (NMT) has achieved remarkable results in many high-resource language pairs and becomes widely used in recent years. However, it generates many out-of-vocabulary (OOV) words in the low-resource parallel corpus, especially for agglutinative language pairs such as Javanese to Indonesian translation. This paper proposes a canonical word segmentation and a linguistic feature tag to be incorporated in a Transformer-based NMT for translating Javanese into Indonesian. The word segmentation is to increase vocabulary frequency of affixed words that rarely appear, while the feature tag is to help the learning process and generates translation output. This research is conducted in two stages. First, we explore some Javanese segmentation approaches using a Transformer-based encoder-decoder to find the best segmentation model. As for the Indonesian language, we use MorphInd to do corpus segmentation. Second, we conduct experiments on NMT by applying canonical segmentation and feature tag resulted in the first stage as the input to the encoder and decoder. Our experiments show that the best canonical segmentation is the one that uses character-level inputs concatenated with feature tags that includes affixes and root words. It achieves an accuracy value of 84.20% of all occurrences and 56.09% of canonical segmentation. This study also reports that it reaches a F1 score of 92.78% and 96.35% for all words and canonical segmentation, respectively. As for the NMT experiments, the results show that the proposed canonical segmentation and affixes/root word feature tag applied to NMT model improves the translation performance. Our best model increases the BLEU score by 3.55 points compared to baseline model using words as inputs. It also increases as much as 2.57 BLEU points compared to baseline model using BPE segmentation.
KW - Canonical segmentation
KW - Javanese-Indonesian NMT
KW - Linguistic feature tag
KW - Neural machine translation.
UR - http://www.scopus.com/inward/record.url?scp=85174687180&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:85174687180
SN - 1823-4690
VL - 18
SP - 62
EP - 68
JO - Journal of Engineering Science and Technology
JF - Journal of Engineering Science and Technology
ER -