TY - JOUR
T1 - Automatic Extraction of Indonesian Stopwords
AU - Achsan, Harry Tursulistyono Yani
AU - Suhartanto, Heru
AU - Wibowo, Wahyu Catur
AU - Dewi, Deshinta A.
AU - Ismed, Khairul
N1 - Funding Information:
ACKNOWLEDGMENT Excellent Research Grants of Higher Education (PT-UPT), Directorate General of Higher Education, Research, and Technology, Ministry of Education, Culture, Research, and Technology of the Republic of Indonesia (Ditjen Dikti - Kemdikbud) titled "Representation of Multi Talents of Covid-19 Expert Based on indexed publication data, 2019 & 2020".
Funding Information:
Excellent Research Grants of Higher Education (PT-UPT), Directorate General of Higher Education, Research, and Technology, Ministry of Education, Culture, Research, and Technology of the Republic of Indonesia (Ditjen Dikti-Kemdikbud) titled "Representation of Multi Talents of Covid-19 Expert Based on indexed publication data, 2019 & 2020"
Publisher Copyright:
© 2023, International Journal of Advanced Computer Science and Applications.All Rights Reserved.
PY - 2023
Y1 - 2023
N2 - The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required.
AB - The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required.
KW - attributes reduction
KW - Indonesian stopwords
KW - large corpus
KW - NLP
KW - Stopwords extraction
KW - TF-IDF
UR - http://www.scopus.com/inward/record.url?scp=85149688462&partnerID=8YFLogxK
U2 - 10.14569/IJACSA.2023.0140221
DO - 10.14569/IJACSA.2023.0140221
M3 - Article
AN - SCOPUS:85149688462
SN - 2158-107X
VL - 14
SP - 166
EP - 171
JO - International Journal of Advanced Computer Science and Applications
JF - International Journal of Advanced Computer Science and Applications
IS - 2
ER -