StatMetaQA: A dataset for closed domain question answering in Indonesian statistical metadata

Nur Rachmawati, Evi Yulianti

Research output: Contribution to journalArticlepeer-review

Abstract

A closed domain question answering (QA) dataset in statistical metadata is important to build an effective QA system about statistic. This dataset can be utilized to train or fine-tune the QA models in statistic. Further, it can also be exploited to evaluate the effectiveness of any QA methods in statistical domain. In this research, we build a new dataset of statistical metadata documents and question-answer pairs annotations of these documents in Indonesian language, called StatMetaQA (Statistical Metadata Question Answering). The collection of statistical metadata documents is used as the knowledge base of a QA system, while the collection of question-answer pairs annotations is used to train or fine-tune the QA models in statistic. The collection of statistical metadata documents, consisting of 861 statistical activity metadata documents and 1,231 statistical indicator metadata documents, was obtained from a website managed by the Statistics Indonesia (http://sirusa.bps.go.id). Next, the collection of question-answer pairs about statistical metadata, consisting of 28,863 question-answer pairs from 1,000 statistical metadata documents, was obtained using two strategies: human and automatic annotation. Here, 7353 question-answer pairs were manually annotated by human, and 21,510 question-answer pairs were automatically generated by machine using our predefined templates that were applied on some document fields of statistical metadata.

Original languageEnglish
Article number110816
JournalData in Brief
Volume57
DOIs
Publication statusPublished - Dec 2024

Keywords

  • Dataset
  • Indonesia
  • Question answering
  • Statistical metadata

Fingerprint

Dive into the research topics of 'StatMetaQA: A dataset for closed domain question answering in Indonesian statistical metadata'. Together they form a unique fingerprint.

Cite this