A study using machine learning with Ngram model in harmonized system classification

Prihastuti Harsani, Adang Suhendra, Lily Wulandari, Wahyu Catur Wibowo

Research output: Contribution to journalArticlepeer-review

Abstract

Harmonized System or commonly called HS is a list of classifications of goods made systematically with the aim of facilitating the taxing, trade transactions, transportation and statistics that have been improved from the previous classification system. In international trade (import / export) each item to be traded must be determined its HS Code based on the description that accompanies the goods. The description of imported goods in the form of text will be translated into the classification of imported goods regulated in the 2017 Indonesian Customs Tariff Book BTKI is the Indonesian Customs Tariff Book that contains the goods classification system applicable in Indonesia, including Provisions for Interpretation (KUMHS), Notes, and Goods Classification Structures compiled based on the ASEAN Harmonized Tariff Nomenclature (AHTN) Harmonized System. The classification of goods based on the HS code faces several challenges, including HS Complexity, Gaps in HS terminology, The amount of text in the goods description. This study conducted an experiment that applied machine learning in classifying imported goods. The focus of this research is the classification based on short text categorization. Documents compiled on pandek text in accordance with the characteristics of the description of the goods. The study conducted experiments with three methods, namely: Libshorttext, text categorization (Text) and topic modeling. Feature extraction methods used are Term Frequency Index Document Frequency (TF-IDF) and Latent Dirichlect Allocation (LDA). Classification is done based on the 8 digit HS system. The goods description that accompanies transaction data has an average number of words as many as 7. Classification of goods based on the HS code is a matter of categorizing short texts. The feature used is the Ngram model. The method used is Libshort, Text Categorization and topic modelling. evaluation shows that libshort has the highest accuracy and fscore value followed by text categorization and topic modeling. SVM and KNN give two different results on the classification. Based on the experimental results, it is not yet concluded whether an increase in N values on the N-Gram model will result in a better FScore value on short texts.

Original languageEnglish
Pages (from-to)145-153
Number of pages9
JournalJournal of Advanced Research in Dynamical and Control Systems
Volume12
Issue number6 Special Issue
DOIs
Publication statusPublished - 2020

Keywords

  • HS
  • KNN
  • LDA
  • Ngram
  • SVM
  • TF-IDF

Fingerprint Dive into the research topics of 'A study using machine learning with Ngram model in harmonized system classification'. Together they form a unique fingerprint.

Cite this