A Comparison between SymSpell and a Combination of Damerau-Levenshtein Distance with the Trie Data Structure

Hanif Arkan Audah, Arlisa Yuliawati, Ika Alfina

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Non-word error results from a spelling error where the word itself is not in the dictionary and is not a known word. This study compares two non-word error correction methods for Indonesian: SymSpell and a combination of Damerau-Levenshtein distance with the trie data structure (DLTrie). We evaluated the performance of both methods for isolated-word and context-dependent cases. For SymSpell, we implemented its two variants: weighted and unweighted. Furthermore, we enriched the KBBI V dictionary with additional words from Wiktionary to form an Indonesian dictionary of 91,557 words. To evaluate both methods, we built a synthetic dataset containing 58,532 misspellings. The evaluation measures the best-match accuracy, candidate accuracy, and run time. The experiment shows that for isolated-word cases, SymSpell performed better than DLTrie as it obtained a higher best-match accuracy and a lower run time than DLTrie. The best-performing SymSpell implementation is the weighted SymSpell, which has the best-match accuracy of 66.79%, candidate accuracy of 99.33%, and a run time of 0.39 ms per word. On the other hand, for context-dependent cases, SymSpell obtained a slightly lower best-match accuracy of 89.58% compared to DLTrie's 89.93%, but it was faster by several orders of magnitude.

Original languageEnglish
Title of host publication2023 10th International Conference on Advanced Informatics
Subtitle of host publicationConcept, Theory and Application, ICAICTA 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350329919
DOIs
Publication statusPublished - 2023
Event10th International Conference on Advanced Informatics: Concept, Theory and Application, ICAICTA 2023 - Lombok, Indonesia
Duration: 7 Oct 20239 Oct 2023

Publication series

Name2023 10th International Conference on Advanced Informatics: Concept, Theory and Application, ICAICTA 2023

Conference

Conference10th International Conference on Advanced Informatics: Concept, Theory and Application, ICAICTA 2023
Country/TerritoryIndonesia
CityLombok
Period7/10/239/10/23

Keywords

  • damerau-levenshtein
  • edit distance
  • isolated-word error correction
  • non-word error
  • spell checker
  • symspell

Fingerprint

Dive into the research topics of 'A Comparison between SymSpell and a Combination of Damerau-Levenshtein Distance with the Trie Data Structure'. Together they form a unique fingerprint.

Cite this