Visual question answering for monas tourism object using deep learning

Ahmad Hasan Siregar, Dina Chahyati

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Visual Question Answering (VQA) is a machine learning task, given a pair of image and natural language visual question about the image, the task is to answer the question. It is known that there is no public VQA dataset currently available in Bahasa Indonesia. This research compiles a Monas VQA dataset that uses Bahasa Indonesia in the question and Monas, a memorial monument for Indonesian, as the image specific context to resolve the problem. This research also proposes methods to solve VQA using CNN for image embedding, techniques from the NLP field for sentence embedding e.g. Bag-of-Words, fastText, BERT, and BiLSTM, lastly multimodal machine learning to let both embedded information to interact with each other. The best performing model achieves 68.39% accuracy with architecture impact analysis presented.

Original languageEnglish
Title of host publication2020 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages381-386
Number of pages6
ISBN (Electronic)9781728192796
DOIs
Publication statusPublished - 17 Oct 2020
Event12th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2020 - Virtual, Depok, Indonesia
Duration: 17 Oct 202018 Oct 2020

Publication series

Name2020 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2020

Conference

Conference12th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2020
Country/TerritoryIndonesia
CityVirtual, Depok
Period17/10/2018/10/20

Keywords

  • Bahasa Indonesia
  • Dataset building
  • Embedding
  • Multimodal machine learning
  • Visual Question Answering (VQA)

Fingerprint

Dive into the research topics of 'Visual question answering for monas tourism object using deep learning'. Together they form a unique fingerprint.

Cite this