Optical Character Recognition Engines Performance Comparison in Information Extraction

Tosan Wiar Ramdhani, Indra Budi, Betty Purwandari

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Named Entity Recognition (NER) is often used to acquire important information from text documents as a part of the Information Extraction (IE) process. However, the text documents quality affects the accuracy of the data obtained, especially for text documents acquired involving the Optical Character Recognition (OCR) process, which never reached 100% accuracy. This research tried to examine which OCR engine with the highest performance for IE using NER by comparing three OCR engines (Foxit, PDF2GO, Tesseract) over 8,562 government human resources documents within six document categories, two document structures, and four measurements. Several essential entities such as name, employee ID, document number, document publishing date, employee rank, and family member's name were trying to be extracted automatically from the documents. NER processes were done using Python programming language, and the preprocessing tasks were done separately for Foxit, PDF2GO, and Tesseract. In summary, each OCR engine has its drawbacks and benefit, such as Tesseract has better NER extraction and conversion time with better accuracy but lack in the number of entities acquired.

Original languageEnglish
Pages (from-to)120-127
Number of pages8
JournalInternational Journal of Advanced Computer Science and Applications
Volume12
Issue number8
DOIs
Publication statusPublished - 2021

Keywords

  • government human resources documents
  • information extraction
  • Named entity recognition
  • optical character recognition

Fingerprint

Dive into the research topics of 'Optical Character Recognition Engines Performance Comparison in Information Extraction'. Together they form a unique fingerprint.

Cite this