Converting an Indonesian Constituency Treebank to the Penn Treebank Format

Jessica Naraiswari Arwidarasti, Ika Alfina, Adila Alfa Krisnadhi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Citations (Scopus)

Abstract

A constituency treebank is a key component for deep syntactic parsing of natural language sentences. For Indonesian, this task is unfortunately hindered by the fact that the only one constituency treebank publicly available is rather small with just over 1000 sentences, and not only that, it employs a format incompatible with readily available constituency treebank processing tools. In this work, we present a conversion of the existing Indonesian constituency treebank to the widely accepted Penn Treebank format. Specifically, the conversion adjusts the bracketing format for compound words as well as the POS tagset according to the Penn Treebank format. In addition, we revised the word segmentation and POS tagging of a number of tokens. Finally, we performed an evaluation on the treebank quality by employing the Shift-Reduce parser from Stanford CoreNLP to create a parser model. A 10-fold cross-validated experiment on the parser model yields an F1-score of 70.90%.

Original languageEnglish
Title of host publicationProceedings of the 2019 International Conference on Asian Language Processing, IALP 2019
EditorsMan Lan, Yuanbin Wu, Minghui Dong, Yanfeng Lu, Yan Yang
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages331-336
Number of pages6
ISBN (Electronic)9781728150147
DOIs
Publication statusPublished - Nov 2019
Event23rd International Conference on Asian Language Processing, IALP 2019 - Shanghai, China
Duration: 15 Nov 201917 Nov 2019

Publication series

NameProceedings of the 2019 International Conference on Asian Language Processing, IALP 2019

Conference

Conference23rd International Conference on Asian Language Processing, IALP 2019
Country/TerritoryChina
CityShanghai
Period15/11/1917/11/19

Keywords

  • constituency parsing
  • Indonesian
  • Penn Tree-bank
  • Stanford parser
  • treebank format

Fingerprint

Dive into the research topics of 'Converting an Indonesian Constituency Treebank to the Penn Treebank Format'. Together they form a unique fingerprint.

Cite this