TY - GEN
T1 - IndoKEPLER, IndoWiki, and IndoLAMA
T2 - 7th International Workshop on Big Data and Information Security, IWBIS 2022
AU - Ramli, Inigo
AU - Krisnadhi, Adila Alfa
AU - Prasojo, Radityo Eko
N1 - Funding Information:
ACKNOWLEDGEMENTS The authors acknowledge the support of computing resources in conducting this work from Tokopedia-UI AI Center of Excellence, Universitas Indonesia.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Pretrained language models posses an ability to learn the structural representation of a natural language by processing unstructured textual data. However, the current language model design lacks the ability to learn factual knowledge from knowledge graphs. Several attempts have been made to address this issue, such as the development of KEPLER. KEPLER combines the BERT language model and TransE knowledge embedding method to achieve a language model that can incorporate knowledge graphs as training data. Unfortunately, such knowledge enhanced language model is not yet available for the Indonesian language. In this experiment, we propose IndoKEPLER: a language model trained usingWikipedia Bahasa Indonesia andWikidata. We also create a new knowledge probing benchmark named IndoLAMA to test the ability of a language model to recall factual knowledge. The benchmark is based on LAMA, which is designed to test the suitability of our language model to be used as a knowledge base. IndoLAMA tests a language model by giving cloze style question and compare the prediction of the model to the factually correct answer. This experiment shows that IndoKEPLER increases the ability of a normal DistilBERT model to recall factual knowledge by 0.8%. Moreover, the most significant increase happens when dealing with many-to-one relationships, where IndoKEPLER outperforms it's original text encoder model by 3%.
AB - Pretrained language models posses an ability to learn the structural representation of a natural language by processing unstructured textual data. However, the current language model design lacks the ability to learn factual knowledge from knowledge graphs. Several attempts have been made to address this issue, such as the development of KEPLER. KEPLER combines the BERT language model and TransE knowledge embedding method to achieve a language model that can incorporate knowledge graphs as training data. Unfortunately, such knowledge enhanced language model is not yet available for the Indonesian language. In this experiment, we propose IndoKEPLER: a language model trained usingWikipedia Bahasa Indonesia andWikidata. We also create a new knowledge probing benchmark named IndoLAMA to test the ability of a language model to recall factual knowledge. The benchmark is based on LAMA, which is designed to test the suitability of our language model to be used as a knowledge base. IndoLAMA tests a language model by giving cloze style question and compare the prediction of the model to the factually correct answer. This experiment shows that IndoKEPLER increases the ability of a normal DistilBERT model to recall factual knowledge by 0.8%. Moreover, the most significant increase happens when dealing with many-to-one relationships, where IndoKEPLER outperforms it's original text encoder model by 3%.
KW - Indonesian language
KW - knowledge embedding
KW - knowledge graph
KW - Language model
KW - natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85141842381&partnerID=8YFLogxK
U2 - 10.1109/IWBIS56557.2022.9924844
DO - 10.1109/IWBIS56557.2022.9924844
M3 - Conference contribution
AN - SCOPUS:85141842381
T3 - IWBIS 2022 - 7th International Workshop on Big Data and Information Security, Proceedings
SP - 19
EP - 26
BT - IWBIS 2022 - 7th International Workshop on Big Data and Information Security, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 October 2022 through 3 October 2022
ER -