TY - GEN
T1 - Converting an Indonesian Constituency Treebank to the Penn Treebank Format
AU - Arwidarasti, Jessica Naraiswari
AU - Alfina, Ika
AU - Krisnadhi, Adila Alfa
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - A constituency treebank is a key component for deep syntactic parsing of natural language sentences. For Indonesian, this task is unfortunately hindered by the fact that the only one constituency treebank publicly available is rather small with just over 1000 sentences, and not only that, it employs a format incompatible with readily available constituency treebank processing tools. In this work, we present a conversion of the existing Indonesian constituency treebank to the widely accepted Penn Treebank format. Specifically, the conversion adjusts the bracketing format for compound words as well as the POS tagset according to the Penn Treebank format. In addition, we revised the word segmentation and POS tagging of a number of tokens. Finally, we performed an evaluation on the treebank quality by employing the Shift-Reduce parser from Stanford CoreNLP to create a parser model. A 10-fold cross-validated experiment on the parser model yields an F1-score of 70.90%.
AB - A constituency treebank is a key component for deep syntactic parsing of natural language sentences. For Indonesian, this task is unfortunately hindered by the fact that the only one constituency treebank publicly available is rather small with just over 1000 sentences, and not only that, it employs a format incompatible with readily available constituency treebank processing tools. In this work, we present a conversion of the existing Indonesian constituency treebank to the widely accepted Penn Treebank format. Specifically, the conversion adjusts the bracketing format for compound words as well as the POS tagset according to the Penn Treebank format. In addition, we revised the word segmentation and POS tagging of a number of tokens. Finally, we performed an evaluation on the treebank quality by employing the Shift-Reduce parser from Stanford CoreNLP to create a parser model. A 10-fold cross-validated experiment on the parser model yields an F1-score of 70.90%.
KW - constituency parsing
KW - Indonesian
KW - Penn Tree-bank
KW - Stanford parser
KW - treebank format
UR - http://www.scopus.com/inward/record.url?scp=85083275738&partnerID=8YFLogxK
U2 - 10.1109/IALP48816.2019.9037723
DO - 10.1109/IALP48816.2019.9037723
M3 - Conference contribution
AN - SCOPUS:85083275738
T3 - Proceedings of the 2019 International Conference on Asian Language Processing, IALP 2019
SP - 331
EP - 336
BT - Proceedings of the 2019 International Conference on Asian Language Processing, IALP 2019
A2 - Lan, Man
A2 - Wu, Yuanbin
A2 - Dong, Minghui
A2 - Lu, Yanfeng
A2 - Yang, Yan
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 23rd International Conference on Asian Language Processing, IALP 2019
Y2 - 15 November 2019 through 17 November 2019
ER -