TY - JOUR
T1 - StatMetaQA
T2 - A dataset for closed domain question answering in Indonesian statistical metadata
AU - Rachmawati, Nur
AU - Yulianti, Evi
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/12
Y1 - 2024/12
N2 - A closed domain question answering (QA) dataset in statistical metadata is important to build an effective QA system about statistic. This dataset can be utilized to train or fine-tune the QA models in statistic. Further, it can also be exploited to evaluate the effectiveness of any QA methods in statistical domain. In this research, we build a new dataset of statistical metadata documents and question-answer pairs annotations of these documents in Indonesian language, called StatMetaQA (Statistical Metadata Question Answering). The collection of statistical metadata documents is used as the knowledge base of a QA system, while the collection of question-answer pairs annotations is used to train or fine-tune the QA models in statistic. The collection of statistical metadata documents, consisting of 861 statistical activity metadata documents and 1,231 statistical indicator metadata documents, was obtained from a website managed by the Statistics Indonesia (http://sirusa.bps.go.id). Next, the collection of question-answer pairs about statistical metadata, consisting of 28,863 question-answer pairs from 1,000 statistical metadata documents, was obtained using two strategies: human and automatic annotation. Here, 7353 question-answer pairs were manually annotated by human, and 21,510 question-answer pairs were automatically generated by machine using our predefined templates that were applied on some document fields of statistical metadata.
AB - A closed domain question answering (QA) dataset in statistical metadata is important to build an effective QA system about statistic. This dataset can be utilized to train or fine-tune the QA models in statistic. Further, it can also be exploited to evaluate the effectiveness of any QA methods in statistical domain. In this research, we build a new dataset of statistical metadata documents and question-answer pairs annotations of these documents in Indonesian language, called StatMetaQA (Statistical Metadata Question Answering). The collection of statistical metadata documents is used as the knowledge base of a QA system, while the collection of question-answer pairs annotations is used to train or fine-tune the QA models in statistic. The collection of statistical metadata documents, consisting of 861 statistical activity metadata documents and 1,231 statistical indicator metadata documents, was obtained from a website managed by the Statistics Indonesia (http://sirusa.bps.go.id). Next, the collection of question-answer pairs about statistical metadata, consisting of 28,863 question-answer pairs from 1,000 statistical metadata documents, was obtained using two strategies: human and automatic annotation. Here, 7353 question-answer pairs were manually annotated by human, and 21,510 question-answer pairs were automatically generated by machine using our predefined templates that were applied on some document fields of statistical metadata.
KW - Dataset
KW - Indonesia
KW - Question answering
KW - Statistical metadata
UR - http://www.scopus.com/inward/record.url?scp=85203518207&partnerID=8YFLogxK
U2 - 10.1016/j.dib.2024.110816
DO - 10.1016/j.dib.2024.110816
M3 - Article
AN - SCOPUS:85203518207
SN - 2352-3409
VL - 57
JO - Data in Brief
JF - Data in Brief
M1 - 110816
ER -