TY - GEN
T1 - InaNLP
T2 - 4th IGNITE Conference and 2016 International Conference on Advanced Informatics: Concepts, Theory and Application, ICAICTA 2016
AU - Purwarianti, Ayu
AU - Andhika, Alvin
AU - Wicaksono, Alfan Farizki
AU - Afif, Irfan
AU - Ferdian, Filman
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/12/30
Y1 - 2016/12/30
N2 - This research discusses how natural language processing (NLP) toolkit for Indonesia formal text and social media text, named as InaNLP, has been developed. Several NLP modules were integrated into InaNLP to make people easier in building an NLP system for Indonesia language. The toolkit contains several NLP modules such as sentence splitter, tokenization, Part of Speech (POS) tagger, phrase chunker, named entity (NE) tagger, syntactic parser, semantic analyzer, and word normalization. Several NLP modules were built using rule based approach, whereas several others implemented statistical based approach. Here, the accuracy of several modules such as the POS tagger, NE tagger, syntactic parser and semantic analyzer are shown. In the NE tagger, five (5) word windows with features of POS, orthography, and word list are used. In the NE tagger experiment for evaluating the features, using SMO algorithm and 1500 sentences, for 15 NE classes, token classification accuracy of 93.419%, which outperform related work, could be achieved. For the POS tagger, using 12,000 token as the training data and 3,000 token as the testing data, the accuracy of 96.50% was achieved. For the syntactic parser, using CYK algorithm and 100 sentences as the training data and 36 sentences as the testing data, the experiment achieved the accuracy of 47.22%. For the semantic analyzer, using 200 sentences as the training data, the experiment achieved the accuracy of 62.50%. This research also shows an example in building an Indonesia NLP system using InaNLP for complaint tweet classification. In the experiment for the complaint classification, using 7440 data, the experiment achieved 0.892 of average F-measure score.
AB - This research discusses how natural language processing (NLP) toolkit for Indonesia formal text and social media text, named as InaNLP, has been developed. Several NLP modules were integrated into InaNLP to make people easier in building an NLP system for Indonesia language. The toolkit contains several NLP modules such as sentence splitter, tokenization, Part of Speech (POS) tagger, phrase chunker, named entity (NE) tagger, syntactic parser, semantic analyzer, and word normalization. Several NLP modules were built using rule based approach, whereas several others implemented statistical based approach. Here, the accuracy of several modules such as the POS tagger, NE tagger, syntactic parser and semantic analyzer are shown. In the NE tagger, five (5) word windows with features of POS, orthography, and word list are used. In the NE tagger experiment for evaluating the features, using SMO algorithm and 1500 sentences, for 15 NE classes, token classification accuracy of 93.419%, which outperform related work, could be achieved. For the POS tagger, using 12,000 token as the training data and 3,000 token as the testing data, the accuracy of 96.50% was achieved. For the syntactic parser, using CYK algorithm and 100 sentences as the training data and 36 sentences as the testing data, the experiment achieved the accuracy of 47.22%. For the semantic analyzer, using 200 sentences as the training data, the experiment achieved the accuracy of 62.50%. This research also shows an example in building an Indonesia NLP system using InaNLP for complaint tweet classification. In the experiment for the complaint classification, using 7440 data, the experiment achieved 0.892 of average F-measure score.
KW - InaNLP
KW - Indonesia language
KW - natural language processing toolkit
UR - http://www.scopus.com/inward/record.url?scp=85011304724&partnerID=8YFLogxK
U2 - 10.1109/ICAICTA.2016.7803103
DO - 10.1109/ICAICTA.2016.7803103
M3 - Conference contribution
AN - SCOPUS:85011304724
T3 - 4th IGNITE Conference and 2016 International Conference on Advanced Informatics: Concepts, Theory and Application, ICAICTA 2016
BT - 4th IGNITE Conference and 2016 International Conference on Advanced Informatics
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 August 2016 through 19 August 2016
ER -