TY - JOUR
T1 - A Gold Standard Dataset for Javanese Tokenization, POS Tagging, Morphological Feature Tagging, and Dependency Parsing
AU - Alfina, Ika
AU - Yuliawati, Arlisa
AU - Tanaya, Dipta
AU - Dinakaramani, Arawinda
AU - Zeman, Daniel
N1 - Publisher Copyright:
© 2024 by the author(s).
PY - 2024/11
Y1 - 2024/11
N2 - Javanese, a regional language in Indonesia with more than 68 million speakers, is a low-resource language in the Natural Language Processing (NLP) field because it needs more language resources in both dataset and NLP tools. In this work, we developed a gold standard dataset of 1,000 sentences and 14,323 words for Javanese for four NLP tasks: tokenization, part-of-speech (POS) tagging, morphological feature tagging, and dependency parsing. This dataset is in the CoNLL-U format that conforms with the Universal Dependencies (UD) annotation guidelines. We involved native Javanese speakers as the annotators. Javanese sentences are taken from grammar books, Wikipedia, and online newspapers. We build models for tokenization, POS tagging, morphological feature tagging, and dependency parsing using UDPipe to evaluate the dataset's quality. The evaluation was conducted with the 10-fold cross-validation method. For the tokenization task, our model has an F1 score of 99.53%, 72.01%, 97.11%, and 95.90% for segmenting tokens, multiword tokens (MWT), syntactic words, and sentences, respectively. For POS and morphological feature tagging from gold tokenization, the model has an F1-score of 87.22% and 86.66% for POS tagging and morphological feature tagging. Finally, for the dependency parsing task, parsing from gold tokenization with gold tags has an Unlabeled Attachment Score (UAS) of 77.08% and a Labeled Attachment Score (LAS) of 71.21%.
AB - Javanese, a regional language in Indonesia with more than 68 million speakers, is a low-resource language in the Natural Language Processing (NLP) field because it needs more language resources in both dataset and NLP tools. In this work, we developed a gold standard dataset of 1,000 sentences and 14,323 words for Javanese for four NLP tasks: tokenization, part-of-speech (POS) tagging, morphological feature tagging, and dependency parsing. This dataset is in the CoNLL-U format that conforms with the Universal Dependencies (UD) annotation guidelines. We involved native Javanese speakers as the annotators. Javanese sentences are taken from grammar books, Wikipedia, and online newspapers. We build models for tokenization, POS tagging, morphological feature tagging, and dependency parsing using UDPipe to evaluate the dataset's quality. The evaluation was conducted with the 10-fold cross-validation method. For the tokenization task, our model has an F1 score of 99.53%, 72.01%, 97.11%, and 95.90% for segmenting tokens, multiword tokens (MWT), syntactic words, and sentences, respectively. For POS and morphological feature tagging from gold tokenization, the model has an F1-score of 87.22% and 86.66% for POS tagging and morphological feature tagging. Finally, for the dependency parsing task, parsing from gold tokenization with gold tags has an Unlabeled Attachment Score (UAS) of 77.08% and a Labeled Attachment Score (LAS) of 71.21%.
KW - Annotation Guidelines
KW - Dependency Parsing
KW - Low-Resource Language
KW - Morphological Feature Tagging
KW - POS Tagging
KW - Tokenization
KW - Universal Dependencies
UR - http://www.scopus.com/inward/record.url?scp=85209761502&partnerID=8YFLogxK
U2 - 10.30564/fls.v6i5.6957
DO - 10.30564/fls.v6i5.6957
M3 - Article
AN - SCOPUS:85209761502
SN - 2705-0610
VL - 6
SP - 131
EP - 148
JO - Forum for Linguistic Studies
JF - Forum for Linguistic Studies
IS - 5
ER -