Canonical Segmentation Using Affix Characters as a Unit on Transformer for Javanese Language

Sri Hartati Wijono, Machmud R. Alhamidi, Muhammad Hafizhuddin Hilman, Wisnu Jatmiko

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

Morphological segmentation for agglutinative languages is the process of getting stems and affixes. Morphological segmentation is a necessary process in various NLP applications such as machine translation, question answering, and speech recognition. Several neural morphological segmentation studies have used the sequence of characters as input to encoder-decoder. However, this can not provide linguistic information. We propose affix characters as a unit to provide affixes feature on Transformer encoder-decoder. We use the Javanese word corpus which consists of affixed, canonical affixed, and non-affixed words. For affixed words, our proposed method obtains 11.2 times higher point of accuracy than the Sequence of Characters. For canonical affixed words, we get 21.9 times higher point of accuracy than the baseline method. The results also show that the use of different affix symbols, which are '%%', '##', and '@@' for each type of affix improve accuracy in affix recognition.

Original languageEnglish
Title of host publicationProceedings - IWBIS 2021
Subtitle of host publication6th International Workshop on Big Data and Information Security
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages67-72
Number of pages6
ISBN (Electronic)9781665424516
DOIs
Publication statusPublished - 2021
Event6th International Workshop on Big Data and Information Security, IWBIS 2021 - Virtual, Online, Indonesia
Duration: 23 Oct 202126 Oct 2021

Publication series

NameProceedings - IWBIS 2021: 6th International Workshop on Big Data and Information Security

Conference

Conference6th International Workshop on Big Data and Information Security, IWBIS 2021
Country/TerritoryIndonesia
CityVirtual, Online
Period23/10/2126/10/21

Keywords

  • affix characters as a unit
  • canonical segmentation
  • Javanese language
  • Transformer

Fingerprint

Dive into the research topics of 'Canonical Segmentation Using Affix Characters as a Unit on Transformer for Javanese Language'. Together they form a unique fingerprint.

Cite this