InaNLP: Indonesia natural language processing toolkit, case study: Complaint tweet classification

Ayu Purwarianti, Alvin Andhika, Alfan Farizki Wicaksono, Irfan Afif, Filman Ferdian

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

35 Citations (Scopus)

Abstract

This research discusses how natural language processing (NLP) toolkit for Indonesia formal text and social media text, named as InaNLP, has been developed. Several NLP modules were integrated into InaNLP to make people easier in building an NLP system for Indonesia language. The toolkit contains several NLP modules such as sentence splitter, tokenization, Part of Speech (POS) tagger, phrase chunker, named entity (NE) tagger, syntactic parser, semantic analyzer, and word normalization. Several NLP modules were built using rule based approach, whereas several others implemented statistical based approach. Here, the accuracy of several modules such as the POS tagger, NE tagger, syntactic parser and semantic analyzer are shown. In the NE tagger, five (5) word windows with features of POS, orthography, and word list are used. In the NE tagger experiment for evaluating the features, using SMO algorithm and 1500 sentences, for 15 NE classes, token classification accuracy of 93.419%, which outperform related work, could be achieved. For the POS tagger, using 12,000 token as the training data and 3,000 token as the testing data, the accuracy of 96.50% was achieved. For the syntactic parser, using CYK algorithm and 100 sentences as the training data and 36 sentences as the testing data, the experiment achieved the accuracy of 47.22%. For the semantic analyzer, using 200 sentences as the training data, the experiment achieved the accuracy of 62.50%. This research also shows an example in building an Indonesia NLP system using InaNLP for complaint tweet classification. In the experiment for the complaint classification, using 7440 data, the experiment achieved 0.892 of average F-measure score.

Original languageEnglish
Title of host publication4th IGNITE Conference and 2016 International Conference on Advanced Informatics
Subtitle of host publicationConcepts, Theory and Application, ICAICTA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509016365
DOIs
Publication statusPublished - 30 Dec 2016
Event4th IGNITE Conference and 2016 International Conference on Advanced Informatics: Concepts, Theory and Application, ICAICTA 2016 - Penang, Malaysia
Duration: 16 Aug 201619 Aug 2016

Publication series

Name4th IGNITE Conference and 2016 International Conference on Advanced Informatics: Concepts, Theory and Application, ICAICTA 2016

Conference

Conference4th IGNITE Conference and 2016 International Conference on Advanced Informatics: Concepts, Theory and Application, ICAICTA 2016
Country/TerritoryMalaysia
CityPenang
Period16/08/1619/08/16

Keywords

  • InaNLP
  • Indonesia language
  • natural language processing toolkit

Fingerprint

Dive into the research topics of 'InaNLP: Indonesia natural language processing toolkit, case study: Complaint tweet classification'. Together they form a unique fingerprint.

Cite this