TY - JOUR
T1 - Optical Character Recognition Engines Performance Comparison in Information Extraction
AU - Ramdhani, Tosan Wiar
AU - Budi, Indra
AU - Purwandari, Betty
N1 - Publisher Copyright:
© 2021. International Journal of Advanced Computer Science and Applications. All Rights Reserved.
PY - 2021
Y1 - 2021
N2 - Named Entity Recognition (NER) is often used to acquire important information from text documents as a part of the Information Extraction (IE) process. However, the text documents quality affects the accuracy of the data obtained, especially for text documents acquired involving the Optical Character Recognition (OCR) process, which never reached 100% accuracy. This research tried to examine which OCR engine with the highest performance for IE using NER by comparing three OCR engines (Foxit, PDF2GO, Tesseract) over 8,562 government human resources documents within six document categories, two document structures, and four measurements. Several essential entities such as name, employee ID, document number, document publishing date, employee rank, and family member's name were trying to be extracted automatically from the documents. NER processes were done using Python programming language, and the preprocessing tasks were done separately for Foxit, PDF2GO, and Tesseract. In summary, each OCR engine has its drawbacks and benefit, such as Tesseract has better NER extraction and conversion time with better accuracy but lack in the number of entities acquired.
AB - Named Entity Recognition (NER) is often used to acquire important information from text documents as a part of the Information Extraction (IE) process. However, the text documents quality affects the accuracy of the data obtained, especially for text documents acquired involving the Optical Character Recognition (OCR) process, which never reached 100% accuracy. This research tried to examine which OCR engine with the highest performance for IE using NER by comparing three OCR engines (Foxit, PDF2GO, Tesseract) over 8,562 government human resources documents within six document categories, two document structures, and four measurements. Several essential entities such as name, employee ID, document number, document publishing date, employee rank, and family member's name were trying to be extracted automatically from the documents. NER processes were done using Python programming language, and the preprocessing tasks were done separately for Foxit, PDF2GO, and Tesseract. In summary, each OCR engine has its drawbacks and benefit, such as Tesseract has better NER extraction and conversion time with better accuracy but lack in the number of entities acquired.
KW - government human resources documents
KW - information extraction
KW - Named entity recognition
KW - optical character recognition
UR - http://www.scopus.com/inward/record.url?scp=85118974484&partnerID=8YFLogxK
U2 - 10.14569/IJACSA.2021.0120814
DO - 10.14569/IJACSA.2021.0120814
M3 - Article
AN - SCOPUS:85118974484
SN - 2158-107X
VL - 12
SP - 120
EP - 127
JO - International Journal of Advanced Computer Science and Applications
JF - International Journal of Advanced Computer Science and Applications
IS - 8
ER -