TY - JOUR
T1 - Twitter dataset on public sentiments towards biodiversity policy in Indonesia
AU - Uliniansyah, Mohammad Teduh
AU - Budi, Indra
AU - Nurfadhilah, Elvira
AU - Afra, Dian Isnaeni Nurul
AU - Santosa, Agung
AU - Latief, Andi Djalal
AU - Jarin, Asril
AU - Gunarso,
AU - Jiwanggi, Meganingrum Arista
AU - Hidayati, Nuraisa Novia
AU - Fajri, Radhiyatul
AU - Suryono, Ryan Randy
AU - Pebiana, Siska
AU - Shaleha, Siti
AU - Ramdhani, Tosan Wiar
AU - Sampurno, Tri
N1 - Publisher Copyright:
© 2023 The Author(s)
PY - 2024/2
Y1 - 2024/2
N2 - In recent years, biodiversity has emerged as a prominent and pressing topic due to the urgent need to address biodiversity loss and the recognition of its connections to climate change and sustainable development. Additionally, increased public awareness and the consideration of economic factors have further underscored the significance of biodiversity conservation. To investigate the sentiment of the Indonesian people towards biodiversity, we conducted a comprehensive data collection on Twitter, focusing on keywords we have set. We amassed a substantial dataset of 500,000 Indonesian tweets from January 2020 to March 2023. These tweets encompassed a wide range of discussions on biodiversity, including its subdomains such as food security, health, and environmental management. Three annotators labeled each tweet with a sentiment class (positive, negative, neutral), or label none for unrelated tweet. The final label was determined using the majority voting method. The tweets with the final label none and those with undecided sentiment class were considered invalid and excluded in the subsequent process. Before labeling, a team of 18 experts jointly developed a labeling guide. This document served as a reference in labeling. After going through a series of processes, including cleaning (removing duplications, irrelevant tweets, and tweets written other than in Indonesian) and preprocessing, we prepared a dataset containing 13,435 tweets. We measured the inter-annotator agreement level, made several models using different algorithms and the K-Fold cross-validation method, and evaluated the models. The Fleiss' Kappa value of the dataset was 0.62187 as the value of the inter-annotator agreement level, and the F1-score value with the best model using the pre-trained IndoBERT model was 0.7959. The Fleiss' Kappa and F1-score values suggest that the annotators have a substantial comprehension and agreement of how to label a tweet, thus ensuring consistency and reliability of our dataset, and the reusability of our dataset is quite suitable for further research on sentiment analysis on biodiversity, respectively. This dataset will benefit various research, including topic modeling, sentiment analysis, public opinion analysis on Twitter, etc., especially biodiversity-related policies.
AB - In recent years, biodiversity has emerged as a prominent and pressing topic due to the urgent need to address biodiversity loss and the recognition of its connections to climate change and sustainable development. Additionally, increased public awareness and the consideration of economic factors have further underscored the significance of biodiversity conservation. To investigate the sentiment of the Indonesian people towards biodiversity, we conducted a comprehensive data collection on Twitter, focusing on keywords we have set. We amassed a substantial dataset of 500,000 Indonesian tweets from January 2020 to March 2023. These tweets encompassed a wide range of discussions on biodiversity, including its subdomains such as food security, health, and environmental management. Three annotators labeled each tweet with a sentiment class (positive, negative, neutral), or label none for unrelated tweet. The final label was determined using the majority voting method. The tweets with the final label none and those with undecided sentiment class were considered invalid and excluded in the subsequent process. Before labeling, a team of 18 experts jointly developed a labeling guide. This document served as a reference in labeling. After going through a series of processes, including cleaning (removing duplications, irrelevant tweets, and tweets written other than in Indonesian) and preprocessing, we prepared a dataset containing 13,435 tweets. We measured the inter-annotator agreement level, made several models using different algorithms and the K-Fold cross-validation method, and evaluated the models. The Fleiss' Kappa value of the dataset was 0.62187 as the value of the inter-annotator agreement level, and the F1-score value with the best model using the pre-trained IndoBERT model was 0.7959. The Fleiss' Kappa and F1-score values suggest that the annotators have a substantial comprehension and agreement of how to label a tweet, thus ensuring consistency and reliability of our dataset, and the reusability of our dataset is quite suitable for further research on sentiment analysis on biodiversity, respectively. This dataset will benefit various research, including topic modeling, sentiment analysis, public opinion analysis on Twitter, etc., especially biodiversity-related policies.
KW - Environmental management
KW - Food security
KW - Health
KW - Indonesian
KW - Natural language processing
KW - Sentiment analysis
UR - http://www.scopus.com/inward/record.url?scp=85178998149&partnerID=8YFLogxK
U2 - 10.1016/j.dib.2023.109890
DO - 10.1016/j.dib.2023.109890
M3 - Article
AN - SCOPUS:85178998149
SN - 2352-3409
VL - 52
JO - Data in Brief
JF - Data in Brief
M1 - 109890
ER -