TY - GEN
T1 - Developing a Singlish Neural Language Model using ELECTRA
AU - Gotera, Galangkangin
AU - Prasojo, Radityo Eko
AU - Isal, Yugo Kartono
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - We develop and benchmark a Singlish pretrained neural language model. To this end, we build a novel 3 GB Singlish freetext dataset collected through various Singaporean websites. Then, we leverage ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) to train a transformer-based Singlish language model. ELECTRA is chosen due to its resource-efficiency to better ensure reproducibility. We further build two text classification datasets in Singlish: sentiment analysis and language identification. We use the two datasets to fine-tune our ELECTRA model and benchmark the results against other available pretrained models in English and Singlish. Our experiments show that our Singlish ELECTRA model is competitive against the best open-source models we found despite being pretrained within a significantly less amount of time. We publicly release the benchmarking dataset.
AB - We develop and benchmark a Singlish pretrained neural language model. To this end, we build a novel 3 GB Singlish freetext dataset collected through various Singaporean websites. Then, we leverage ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) to train a transformer-based Singlish language model. ELECTRA is chosen due to its resource-efficiency to better ensure reproducibility. We further build two text classification datasets in Singlish: sentiment analysis and language identification. We use the two datasets to fine-tune our ELECTRA model and benchmark the results against other available pretrained models in English and Singlish. Our experiments show that our Singlish ELECTRA model is competitive against the best open-source models we found despite being pretrained within a significantly less amount of time. We publicly release the benchmarking dataset.
KW - benchmarking dataset
KW - ELECTRA
KW - language model pretraining
KW - Singlish
UR - http://www.scopus.com/inward/record.url?scp=85142026572&partnerID=8YFLogxK
U2 - 10.1109/ICACSIS56558.2022.9923521
DO - 10.1109/ICACSIS56558.2022.9923521
M3 - Conference contribution
AN - SCOPUS:85142026572
T3 - Proceedings - ICACSIS 2022: 14th International Conference on Advanced Computer Science and Information Systems
SP - 235
EP - 240
BT - Proceedings - ICACSIS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2022
Y2 - 1 October 2022 through 3 October 2022
ER -