TY - GEN
T1 - A Comparison between SymSpell and a Combination of Damerau-Levenshtein Distance with the Trie Data Structure
AU - Audah, Hanif Arkan
AU - Yuliawati, Arlisa
AU - Alfina, Ika
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Non-word error results from a spelling error where the word itself is not in the dictionary and is not a known word. This study compares two non-word error correction methods for Indonesian: SymSpell and a combination of Damerau-Levenshtein distance with the trie data structure (DLTrie). We evaluated the performance of both methods for isolated-word and context-dependent cases. For SymSpell, we implemented its two variants: weighted and unweighted. Furthermore, we enriched the KBBI V dictionary with additional words from Wiktionary to form an Indonesian dictionary of 91,557 words. To evaluate both methods, we built a synthetic dataset containing 58,532 misspellings. The evaluation measures the best-match accuracy, candidate accuracy, and run time. The experiment shows that for isolated-word cases, SymSpell performed better than DLTrie as it obtained a higher best-match accuracy and a lower run time than DLTrie. The best-performing SymSpell implementation is the weighted SymSpell, which has the best-match accuracy of 66.79%, candidate accuracy of 99.33%, and a run time of 0.39 ms per word. On the other hand, for context-dependent cases, SymSpell obtained a slightly lower best-match accuracy of 89.58% compared to DLTrie's 89.93%, but it was faster by several orders of magnitude.
AB - Non-word error results from a spelling error where the word itself is not in the dictionary and is not a known word. This study compares two non-word error correction methods for Indonesian: SymSpell and a combination of Damerau-Levenshtein distance with the trie data structure (DLTrie). We evaluated the performance of both methods for isolated-word and context-dependent cases. For SymSpell, we implemented its two variants: weighted and unweighted. Furthermore, we enriched the KBBI V dictionary with additional words from Wiktionary to form an Indonesian dictionary of 91,557 words. To evaluate both methods, we built a synthetic dataset containing 58,532 misspellings. The evaluation measures the best-match accuracy, candidate accuracy, and run time. The experiment shows that for isolated-word cases, SymSpell performed better than DLTrie as it obtained a higher best-match accuracy and a lower run time than DLTrie. The best-performing SymSpell implementation is the weighted SymSpell, which has the best-match accuracy of 66.79%, candidate accuracy of 99.33%, and a run time of 0.39 ms per word. On the other hand, for context-dependent cases, SymSpell obtained a slightly lower best-match accuracy of 89.58% compared to DLTrie's 89.93%, but it was faster by several orders of magnitude.
KW - damerau-levenshtein
KW - edit distance
KW - isolated-word error correction
KW - non-word error
KW - spell checker
KW - symspell
UR - http://www.scopus.com/inward/record.url?scp=85184662724&partnerID=8YFLogxK
U2 - 10.1109/ICAICTA59291.2023.10390399
DO - 10.1109/ICAICTA59291.2023.10390399
M3 - Conference contribution
AN - SCOPUS:85184662724
T3 - 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application, ICAICTA 2023
BT - 2023 10th International Conference on Advanced Informatics
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 10th International Conference on Advanced Informatics: Concept, Theory and Application, ICAICTA 2023
Y2 - 7 October 2023 through 9 October 2023
ER -