TY - GEN
T1 - Canonical Segmentation Using Affix Characters as a Unit on Transformer for Javanese Language
AU - Wijono, Sri Hartati
AU - Alhamidi, Machmud R.
AU - Hilman, Muhammad Hafizhuddin
AU - Jatmiko, Wisnu
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Morphological segmentation for agglutinative languages is the process of getting stems and affixes. Morphological segmentation is a necessary process in various NLP applications such as machine translation, question answering, and speech recognition. Several neural morphological segmentation studies have used the sequence of characters as input to encoder-decoder. However, this can not provide linguistic information. We propose affix characters as a unit to provide affixes feature on Transformer encoder-decoder. We use the Javanese word corpus which consists of affixed, canonical affixed, and non-affixed words. For affixed words, our proposed method obtains 11.2 times higher point of accuracy than the Sequence of Characters. For canonical affixed words, we get 21.9 times higher point of accuracy than the baseline method. The results also show that the use of different affix symbols, which are '%%', '##', and '@@' for each type of affix improve accuracy in affix recognition.
AB - Morphological segmentation for agglutinative languages is the process of getting stems and affixes. Morphological segmentation is a necessary process in various NLP applications such as machine translation, question answering, and speech recognition. Several neural morphological segmentation studies have used the sequence of characters as input to encoder-decoder. However, this can not provide linguistic information. We propose affix characters as a unit to provide affixes feature on Transformer encoder-decoder. We use the Javanese word corpus which consists of affixed, canonical affixed, and non-affixed words. For affixed words, our proposed method obtains 11.2 times higher point of accuracy than the Sequence of Characters. For canonical affixed words, we get 21.9 times higher point of accuracy than the baseline method. The results also show that the use of different affix symbols, which are '%%', '##', and '@@' for each type of affix improve accuracy in affix recognition.
KW - affix characters as a unit
KW - canonical segmentation
KW - Javanese language
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85124355458&partnerID=8YFLogxK
U2 - 10.1109/IWBIS53353.2021.9631839
DO - 10.1109/IWBIS53353.2021.9631839
M3 - Conference contribution
AN - SCOPUS:85124355458
T3 - Proceedings - IWBIS 2021: 6th International Workshop on Big Data and Information Security
SP - 67
EP - 72
BT - Proceedings - IWBIS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th International Workshop on Big Data and Information Security, IWBIS 2021
Y2 - 23 October 2021 through 26 October 2021
ER -