Dictionary-based Word Segmentation for Javanese

Dipta Tanaya, Mirna Adriani

Research output: Contribution to journalConference articlepeer-review

6 Citations (Scopus)

Abstract

Word segmentation is the first step to process language that written in non-Latin letters such as such as Javanese script. In this study, we report our work on word segmentation based on dictionary approach. In the first phase, we generate all possible segmented word series using a word dictionary. The correct word is selected based on the last character in a word, the last two characters in a word, the difference of two consecutive words, and the frequency of the word in the additional corpus. The experimental results show that identifying words using the frequency of words in the additional corpus yield the best accuracy that is 91.08%.

Original languageEnglish
Pages (from-to)208-213
Number of pages6
JournalProcedia Computer Science
Volume81
DOIs
Publication statusPublished - 1 Jan 2016
Event5th Workshop on Spoken Language Technologies for Under-resourced languages, SLTU 2016 - Yogyakarta, Indonesia
Duration: 9 May 201612 May 2016

Keywords

  • javanese character
  • word segmentation

Fingerprint Dive into the research topics of 'Dictionary-based Word Segmentation for Javanese'. Together they form a unique fingerprint.

Cite this