A Gold Standard Dataset for Javanese Tokenization, POS Tagging, Morphological Feature Tagging, and Dependency Parsing

Ika Alfina, Arlisa Yuliawati, Dipta Tanaya, Arawinda Dinakaramani, Daniel Zeman

Research output: Contribution to journalArticlepeer-review

Abstract

Javanese, a regional language in Indonesia with more than 68 million speakers, is a low-resource language in the Natural Language Processing (NLP) field because it needs more language resources in both dataset and NLP tools. In this work, we developed a gold standard dataset of 1,000 sentences and 14,323 words for Javanese for four NLP tasks: tokenization, part-of-speech (POS) tagging, morphological feature tagging, and dependency parsing. This dataset is in the CoNLL-U format that conforms with the Universal Dependencies (UD) annotation guidelines. We involved native Javanese speakers as the annotators. Javanese sentences are taken from grammar books, Wikipedia, and online newspapers. We build models for tokenization, POS tagging, morphological feature tagging, and dependency parsing using UDPipe to evaluate the dataset's quality. The evaluation was conducted with the 10-fold cross-validation method. For the tokenization task, our model has an F1 score of 99.53%, 72.01%, 97.11%, and 95.90% for segmenting tokens, multiword tokens (MWT), syntactic words, and sentences, respectively. For POS and morphological feature tagging from gold tokenization, the model has an F1-score of 87.22% and 86.66% for POS tagging and morphological feature tagging. Finally, for the dependency parsing task, parsing from gold tokenization with gold tags has an Unlabeled Attachment Score (UAS) of 77.08% and a Labeled Attachment Score (LAS) of 71.21%.

Original languageEnglish
Pages (from-to)131-148
Number of pages18
JournalForum for Linguistic Studies
Volume6
Issue number5
DOIs
Publication statusPublished - Nov 2024

Keywords

  • Annotation Guidelines
  • Dependency Parsing
  • Low-Resource Language
  • Morphological Feature Tagging
  • POS Tagging
  • Tokenization
  • Universal Dependencies

Fingerprint

Dive into the research topics of 'A Gold Standard Dataset for Javanese Tokenization, POS Tagging, Morphological Feature Tagging, and Dependency Parsing'. Together they form a unique fingerprint.

Cite this