Automatic open domain information extraction from Indonesian text

Yohanes Gultom, Wahyu Catur Wibowo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Availability of vast amount of digital documents that have surpassed human processing capability calls for an automatic information extraction method from any text document regardless of their domain. Unfortunately, open domain information extraction (open IE) systems are language-specific and there is no published system for Indonesian language. This paper introduces a system to extract entity relations from Indonesian text in triple format using an NLP pipeline, rule-based candidates generator, rule-based token expander and machine-learning-based triple selector. We cross-validate four candidates: logistic regression, SVM, MLP, Random Forest using our dataset to discover that Random Forest is the best classifier for the triple selector achieving 0.60 F1 score (0.62 precision and 0.58 recall). The low score is largely due to the simplistic candidate generation rules and the coverage of dataset.

Original languageEnglish
Title of host publicationProceedings - WBIS 2017
Subtitle of host publication2017 International Workshop on Big Data and Information Security
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages23-30
Number of pages8
ISBN (Electronic)9781538620380
DOIs
Publication statusPublished - 29 Jan 2018
Event2017 International Workshop on Big Data and Information Security, WBIS 2017 - Jakarta, Indonesia
Duration: 23 Sep 201724 Sep 2017

Publication series

NameProceedings - WBIS 2017: 2017 International Workshop on Big Data and Information Security
Volume2018-January

Conference

Conference2017 International Workshop on Big Data and Information Security, WBIS 2017
Country/TerritoryIndonesia
CityJakarta
Period23/09/1724/09/17

Fingerprint

Dive into the research topics of 'Automatic open domain information extraction from Indonesian text'. Together they form a unique fingerprint.

Cite this