Automatic Extraction of Indonesian Stopwords

Harry Tursulistyono Yani Achsan, Heru Suhartanto, Wahyu Catur Wibowo, Deshinta A. Dewi, Khairul Ismed

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required.

Original languageEnglish
Pages (from-to)166-171
Number of pages6
JournalInternational Journal of Advanced Computer Science and Applications
Volume14
Issue number2
DOIs
Publication statusPublished - 2023

Keywords

  • attributes reduction
  • Indonesian stopwords
  • large corpus
  • NLP
  • Stopwords extraction
  • TF-IDF

Fingerprint

Dive into the research topics of 'Automatic Extraction of Indonesian Stopwords'. Together they form a unique fingerprint.

Cite this