TY - JOUR
T1 - Character recognition system for pegon typed manuscript
AU - Ruldeviyani, Yova
AU - Suhartanto, Heru
AU - Sotardodo, Beltsazar Anugrah
AU - Fahreza, Muhammad Hanif
AU - Septiano, Andre
AU - Rachmadi, Muhammad Febrian
N1 - Publisher Copyright:
© 2024
PY - 2024/8/30
Y1 - 2024/8/30
N2 - The Pegon script is an Arabic-based writing system used for Javanese, Sundanese, Madurese, and Indonesian languages. Due to various reasons, this script is now mainly found among collectors and private Islamic boarding schools (pesantren), creating a need for its preservation. One preservation method is digitization through transcription into machine-encoded text, known as OCR (Optical Character Recognition). No published literature exists on OCR systems for this specific script. This research explores the OCR of Pegon typed manuscripts, introducing novel synthesized and real annotated datasets for this task. These datasets evaluate proposed OCR methods, especially those adapted from existing Arabic OCR systems. Results show that deep learning techniques outperform conventional ones, which fail to detect Pegon text. The proposed system uses YOLOv5 for line segmentation and a CTC-CRNN architecture for line text recognition, achieving an F1-score of 0.94 for segmentation and a CER of 0.03 for recognition.
AB - The Pegon script is an Arabic-based writing system used for Javanese, Sundanese, Madurese, and Indonesian languages. Due to various reasons, this script is now mainly found among collectors and private Islamic boarding schools (pesantren), creating a need for its preservation. One preservation method is digitization through transcription into machine-encoded text, known as OCR (Optical Character Recognition). No published literature exists on OCR systems for this specific script. This research explores the OCR of Pegon typed manuscripts, introducing novel synthesized and real annotated datasets for this task. These datasets evaluate proposed OCR methods, especially those adapted from existing Arabic OCR systems. Results show that deep learning techniques outperform conventional ones, which fail to detect Pegon text. The proposed system uses YOLOv5 for line segmentation and a CTC-CRNN architecture for line text recognition, achieving an F1-score of 0.94 for segmentation and a CER of 0.03 for recognition.
KW - Arabic
KW - Character recognition
KW - Deep learning
KW - Pegon
KW - Segmentation
UR - http://www.scopus.com/inward/record.url?scp=85200973530&partnerID=8YFLogxK
U2 - 10.1016/j.heliyon.2024.e35959
DO - 10.1016/j.heliyon.2024.e35959
M3 - Article
AN - SCOPUS:85200973530
SN - 2405-8440
VL - 10
JO - Heliyon
JF - Heliyon
IS - 16
M1 - e35959
ER -