TY - GEN
T1 - Building Morphological Analyzer for Informal Text in Indonesian
AU - Krisna Dwitama, I. Made
AU - Al Farisi, Muhammad Salman
AU - Alfina, Ika
AU - Dinakaramani, Arawinda
N1 - Funding Information:
This work was supported by Faculty of Computer Science, Universitas Indonesia. We thank Badan Pengembangan dan
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Informal text is heavily used by Indonesian in social media. However, NLP tool that can process such text is still very limited. In this work, we built a morphological analyzer for informal text in Indonesian by adding new rules for informal words to an existing Indonesian morphological analyzer named Aksara. Moreover, we also enrich the Aksara lexicon with informal words. The tool can perform tokenization, lemmatization, and part-of-speech (POS) tagging. Aksara uses a rule-based method using a finite-state transducer with a compiler named Foma. To evaluate the tool, we created a gold standard of 102 sentences with 1434 tokens which around 30 % are informal. The test results show that our tool has a tokenization accuracy of 97.21 %, while lemmatization accuracy for case insensitive is 90.37 %, and POS tagging evaluation reached an F1-Score value of 80%.
AB - Informal text is heavily used by Indonesian in social media. However, NLP tool that can process such text is still very limited. In this work, we built a morphological analyzer for informal text in Indonesian by adding new rules for informal words to an existing Indonesian morphological analyzer named Aksara. Moreover, we also enrich the Aksara lexicon with informal words. The tool can perform tokenization, lemmatization, and part-of-speech (POS) tagging. Aksara uses a rule-based method using a finite-state transducer with a compiler named Foma. To evaluate the tool, we created a gold standard of 102 sentences with 1434 tokens which around 30 % are informal. The test results show that our tool has a tokenization accuracy of 97.21 %, while lemmatization accuracy for case insensitive is 90.37 %, and POS tagging evaluation reached an F1-Score value of 80%.
KW - finite-state transducer
KW - informal text
KW - lemmati-zation
KW - morphological analyzer
KW - POS tagging
KW - tokenization
UR - http://www.scopus.com/inward/record.url?scp=85142100684&partnerID=8YFLogxK
U2 - 10.1109/ICACSIS56558.2022.9923494
DO - 10.1109/ICACSIS56558.2022.9923494
M3 - Conference contribution
AN - SCOPUS:85142100684
T3 - Proceedings - ICACSIS 2022: 14th International Conference on Advanced Computer Science and Information Systems
SP - 199
EP - 204
BT - Proceedings - ICACSIS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2022
Y2 - 1 October 2022 through 3 October 2022
ER -