Creating Indonesian-Javanese parallel corpora using wikipedia articles

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Parallel corpora are necessary for multilingual researches especially in information retrieval (IR) and natural language processing (NLP). However, such corpora are hard to find, specifically for low-resources languages like ethnic languages. Parallel corpora of ethnic languages were usually collected manually. On the other hand, Wikipedia as a free online encyclopedia is supporting more and more languages each year, including ethnic languages in Indonesia. It has become one of the largest multilingual sites in World Wide Web that provides free distributed articles. In this paper, we explore a few sentence alignment methods which have been used before for another domain. We want to check whether Wikipedia can be used as one of the resources for collecting parallel corpora of Indonesian and Javanese, an ethnic language in Indonesia. We used two approaches of sentence alignment by treating Wikipedia as both parallel corpora and comparable corpora. In parallel corpora case, we used sentence length based and word correspondence methods. Meanwhile, we used the characteristics of hypertext links from Wikipedia in comparable corpora case. After the experiments, we can see that Wikipedia is useful enough for our purpose because both approaches gave positive results.

Original languageEnglish
Title of host publicationProceedings - ICACSIS 2014
Subtitle of host publication2014 International Conference on Advanced Computer Science and Information Systems
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages239-245
Number of pages7
ISBN (Electronic)9781479980758
DOIs
Publication statusPublished - 23 Mar 2014
Event2014 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2014 - Jakarta, Indonesia
Duration: 18 Oct 201419 Oct 2014

Publication series

NameProceedings - ICACSIS 2014: 2014 International Conference on Advanced Computer Science and Information Systems

Conference

Conference2014 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2014
Country/TerritoryIndonesia
CityJakarta
Period18/10/1419/10/14

Fingerprint

Dive into the research topics of 'Creating Indonesian-Javanese parallel corpora using wikipedia articles'. Together they form a unique fingerprint.

Cite this