TY - JOUR
T1 - Named entity recognition on Indonesian legal documents
T2 - a dataset and study using transformer-based models
AU - Yulianti, Evi
AU - Bhary, Naradhipa
AU - Abdurrohman, Jafar
AU - Dwitilas, Fariz Wahyuzan
AU - Nuranti, Eka Qadri
AU - Husin, Husna Sarirah
N1 - Publisher Copyright:
© 2024 Institute of Advanced Engineering and Science. All rights reserved.
PY - 2024/10
Y1 - 2024/10
N2 - The large volume of court decision documents in Indonesia poses a challenge for researchers to assist legal practitioners in extracting useful information from the documents. This information can also benefit the general public by improving legal transparency, law enforcement, and people's understanding of the law implementation in Indonesia. A natural language processing task that extracts important information from a document is called named entity recognition (NER). In this study, the NER task is applied to legal domains, which is then referred to as legal entity recognition (LER) task. In this task, some important legal entities, such as judges, prosecutors, and advocates, are extracted from the decision documents. A new Indonesian LER dataset is built, called IndoLER data, consisting of approximately 1K decision documents with 20 types of fine-grained legal entities. Then, the transformer-based models, such as multilingual bidirectional encoder representations from transformers (BERT) or M-BERT, Indonesian BERT or IndoBERT, Indonesian robustly optimized BERT pretraining approach (RoBERTa) or IndoRoBERTa, XLM (cross lingual language model)-RoBERTa or XLMR, are proposed to solve the Indonesian LER task using this dataset. Our experimental results show that the RoBERTa-based models, such as XLM-R and IndoRoBERTa, can outperform the state-of-the-art deep-learning baselines using BiLSTM (bidirectional long short-term memory) and BiLSTM-conditional random field (BiLSTM-CRF) approaches by 7.2% to 7.9% and 2.1% to 2.6%, respectively. XLM-RoBERTa is shown to be the best-performing model, achieving the F1-score of 0.9295.
AB - The large volume of court decision documents in Indonesia poses a challenge for researchers to assist legal practitioners in extracting useful information from the documents. This information can also benefit the general public by improving legal transparency, law enforcement, and people's understanding of the law implementation in Indonesia. A natural language processing task that extracts important information from a document is called named entity recognition (NER). In this study, the NER task is applied to legal domains, which is then referred to as legal entity recognition (LER) task. In this task, some important legal entities, such as judges, prosecutors, and advocates, are extracted from the decision documents. A new Indonesian LER dataset is built, called IndoLER data, consisting of approximately 1K decision documents with 20 types of fine-grained legal entities. Then, the transformer-based models, such as multilingual bidirectional encoder representations from transformers (BERT) or M-BERT, Indonesian BERT or IndoBERT, Indonesian robustly optimized BERT pretraining approach (RoBERTa) or IndoRoBERTa, XLM (cross lingual language model)-RoBERTa or XLMR, are proposed to solve the Indonesian LER task using this dataset. Our experimental results show that the RoBERTa-based models, such as XLM-R and IndoRoBERTa, can outperform the state-of-the-art deep-learning baselines using BiLSTM (bidirectional long short-term memory) and BiLSTM-conditional random field (BiLSTM-CRF) approaches by 7.2% to 7.9% and 2.1% to 2.6%, respectively. XLM-RoBERTa is shown to be the best-performing model, achieving the F1-score of 0.9295.
KW - documents transformer
KW - IndoBERT
KW - M-BERT
KW - Named entity recognition legal
KW - RoBERTa
KW - XLM-RoBERTa
UR - http://www.scopus.com/inward/record.url?scp=85201099200&partnerID=8YFLogxK
U2 - 10.11591/ijece.v14i5.pp5489-5501
DO - 10.11591/ijece.v14i5.pp5489-5501
M3 - Article
AN - SCOPUS:85201099200
SN - 2088-8708
VL - 14
SP - 5489
EP - 5501
JO - International Journal of Electrical and Computer Engineering
JF - International Journal of Electrical and Computer Engineering
IS - 5
ER -