IndoKEPLER, IndoWiki, and IndoLAMA: A Knowledge-enhanced Language Model, Dataset, and Benchmark for the Indonesian Language

Inigo Ramli, Adila Alfa Krisnadhi, Radityo Eko Prasojo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Pretrained language models posses an ability to learn the structural representation of a natural language by processing unstructured textual data. However, the current language model design lacks the ability to learn factual knowledge from knowledge graphs. Several attempts have been made to address this issue, such as the development of KEPLER. KEPLER combines the BERT language model and TransE knowledge embedding method to achieve a language model that can incorporate knowledge graphs as training data. Unfortunately, such knowledge enhanced language model is not yet available for the Indonesian language. In this experiment, we propose IndoKEPLER: a language model trained usingWikipedia Bahasa Indonesia andWikidata. We also create a new knowledge probing benchmark named IndoLAMA to test the ability of a language model to recall factual knowledge. The benchmark is based on LAMA, which is designed to test the suitability of our language model to be used as a knowledge base. IndoLAMA tests a language model by giving cloze style question and compare the prediction of the model to the factually correct answer. This experiment shows that IndoKEPLER increases the ability of a normal DistilBERT model to recall factual knowledge by 0.8%. Moreover, the most significant increase happens when dealing with many-to-one relationships, where IndoKEPLER outperforms it's original text encoder model by 3%.

Original languageEnglish
Title of host publicationIWBIS 2022 - 7th International Workshop on Big Data and Information Security, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages19-26
Number of pages8
ISBN (Electronic)9781665489508
DOIs
Publication statusPublished - 2022
Event7th International Workshop on Big Data and Information Security, IWBIS 2022 - Depok, Indonesia
Duration: 1 Oct 20223 Oct 2022

Publication series

NameIWBIS 2022 - 7th International Workshop on Big Data and Information Security, Proceedings

Conference

Conference7th International Workshop on Big Data and Information Security, IWBIS 2022
Country/TerritoryIndonesia
CityDepok
Period1/10/223/10/22

Keywords

  • Indonesian language
  • knowledge embedding
  • knowledge graph
  • Language model
  • natural language processing

Fingerprint

Dive into the research topics of 'IndoKEPLER, IndoWiki, and IndoLAMA: A Knowledge-enhanced Language Model, Dataset, and Benchmark for the Indonesian Language'. Together they form a unique fingerprint.

Cite this