Stemming Indonesian: A confix-stripping approach

Mirna Adriani, Jelita Asian, Bobby Achirul Awal Nazief, S. M.M. Tahaghoghi, Hugh E. Williams

Research output: Contribution to journalArticlepeer-review

95 Citations (Scopus)

Abstract

Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.

Original languageEnglish
Article number13
JournalACM Transactions on Asian Language Information Processing
Volume6
Issue number4
DOIs
Publication statusPublished - 1 Dec 2007

Keywords

  • Indonesian
  • Information retrieval
  • Stemming

Fingerprint Dive into the research topics of 'Stemming Indonesian: A confix-stripping approach'. Together they form a unique fingerprint.

Cite this