TY - GEN
T1 - Extreme Multilabel Text Classification on Indonesian Tax Court Ruling using Single Channel CNN and IndoBERT Embedding
AU - Khasanah, Isnaini Nurul
AU - Krisnadhi, Adila Alfa
N1 - Funding Information:
ACKNOWLEDGMENT We sincerely thank the support from The Indonesia Endowment Fund for Education (LPDP), Ministry of Finance, Republic of Indonesia and Tokopedia-UI AI Center of Excellence.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Manual searching for legal basis such as paragraphs, articles, and laws when preparing for a tax court hearing is time-consuming. In this paper, we use extreme multilabel text classification approach to predict paragraphs, articles, and laws relevant to an appeal on the Indonesian Tax Court Ruling documents. Traditional machine learning methods, such as random forest, can produce a good performance for an extreme multilabel text classification problem but requires training a huge number of separate classifiers. Meanwhile, deep learning methods such as convolutional neural networks (CNN) can effectively solve the extreme multilabel text classification problem. Furthermore, the use of IndoBERT embedding to represent Indonesian text in multilabel classification problems has not been explored much. This research proposes a single channel CNN model with IndoBERT embedding to solve extreme multilabel text classification problems on Indonesian Tax Court Ruling documents. We use three labeling scenarios: paragraph-level label scenario, article-level label scenario, and law-level label scenario. Our experiments demonstrate that our proposed model (CNN+IndoBERT) outperforms the single channel CNN with Word2Vec embedding and the single channel CNN with fastText embedding in all three labeling scenarios. In addition, our model also outperforms the multiple channel CNN with IndoBERT embedding in both paragraph and article-level label scenarios.
AB - Manual searching for legal basis such as paragraphs, articles, and laws when preparing for a tax court hearing is time-consuming. In this paper, we use extreme multilabel text classification approach to predict paragraphs, articles, and laws relevant to an appeal on the Indonesian Tax Court Ruling documents. Traditional machine learning methods, such as random forest, can produce a good performance for an extreme multilabel text classification problem but requires training a huge number of separate classifiers. Meanwhile, deep learning methods such as convolutional neural networks (CNN) can effectively solve the extreme multilabel text classification problem. Furthermore, the use of IndoBERT embedding to represent Indonesian text in multilabel classification problems has not been explored much. This research proposes a single channel CNN model with IndoBERT embedding to solve extreme multilabel text classification problems on Indonesian Tax Court Ruling documents. We use three labeling scenarios: paragraph-level label scenario, article-level label scenario, and law-level label scenario. Our experiments demonstrate that our proposed model (CNN+IndoBERT) outperforms the single channel CNN with Word2Vec embedding and the single channel CNN with fastText embedding in all three labeling scenarios. In addition, our model also outperforms the multiple channel CNN with IndoBERT embedding in both paragraph and article-level label scenarios.
KW - cnn
KW - extreme multilabel
KW - fasttext
KW - indobert
KW - mlp
KW - multilabel
KW - neural network
KW - random forest
KW - text classification
KW - word embedding
UR - http://www.scopus.com/inward/record.url?scp=85124340349&partnerID=8YFLogxK
U2 - 10.1109/IWBIS53353.2021.9631855
DO - 10.1109/IWBIS53353.2021.9631855
M3 - Conference contribution
AN - SCOPUS:85124340349
T3 - Proceedings - IWBIS 2021: 6th International Workshop on Big Data and Information Security
SP - 59
EP - 66
BT - Proceedings - IWBIS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th International Workshop on Big Data and Information Security, IWBIS 2021
Y2 - 23 October 2021 through 26 October 2021
ER -