Abstract
Word segmentation is the first step to process language that written in non-Latin letters such as such as Javanese script. In this study, we report our work on word segmentation based on dictionary approach. In the first phase, we generate all possible segmented word series using a word dictionary. The correct word is selected based on the last character in a word, the last two characters in a word, the difference of two consecutive words, and the frequency of the word in the additional corpus. The experimental results show that identifying words using the frequency of words in the additional corpus yield the best accuracy that is 91.08%.
Original language | English |
---|---|
Pages (from-to) | 208-213 |
Number of pages | 6 |
Journal | Procedia Computer Science |
Volume | 81 |
DOIs | |
Publication status | Published - 2016 |
Event | 5th Workshop on Spoken Language Technologies for Under-resourced languages, SLTU 2016 - Yogyakarta, Indonesia Duration: 9 May 2016 → 12 May 2016 |
Keywords
- javanese character
- word segmentation