TY - GEN
T1 - Creating Indonesian-Javanese parallel corpora using wikipedia articles
AU - Trisedya, Bayu Distiawan
AU - Inastra, Dyah
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/3/23
Y1 - 2014/3/23
N2 - Parallel corpora are necessary for multilingual researches especially in information retrieval (IR) and natural language processing (NLP). However, such corpora are hard to find, specifically for low-resources languages like ethnic languages. Parallel corpora of ethnic languages were usually collected manually. On the other hand, Wikipedia as a free online encyclopedia is supporting more and more languages each year, including ethnic languages in Indonesia. It has become one of the largest multilingual sites in World Wide Web that provides free distributed articles. In this paper, we explore a few sentence alignment methods which have been used before for another domain. We want to check whether Wikipedia can be used as one of the resources for collecting parallel corpora of Indonesian and Javanese, an ethnic language in Indonesia. We used two approaches of sentence alignment by treating Wikipedia as both parallel corpora and comparable corpora. In parallel corpora case, we used sentence length based and word correspondence methods. Meanwhile, we used the characteristics of hypertext links from Wikipedia in comparable corpora case. After the experiments, we can see that Wikipedia is useful enough for our purpose because both approaches gave positive results.
AB - Parallel corpora are necessary for multilingual researches especially in information retrieval (IR) and natural language processing (NLP). However, such corpora are hard to find, specifically for low-resources languages like ethnic languages. Parallel corpora of ethnic languages were usually collected manually. On the other hand, Wikipedia as a free online encyclopedia is supporting more and more languages each year, including ethnic languages in Indonesia. It has become one of the largest multilingual sites in World Wide Web that provides free distributed articles. In this paper, we explore a few sentence alignment methods which have been used before for another domain. We want to check whether Wikipedia can be used as one of the resources for collecting parallel corpora of Indonesian and Javanese, an ethnic language in Indonesia. We used two approaches of sentence alignment by treating Wikipedia as both parallel corpora and comparable corpora. In parallel corpora case, we used sentence length based and word correspondence methods. Meanwhile, we used the characteristics of hypertext links from Wikipedia in comparable corpora case. After the experiments, we can see that Wikipedia is useful enough for our purpose because both approaches gave positive results.
UR - http://www.scopus.com/inward/record.url?scp=84946685322&partnerID=8YFLogxK
U2 - 10.1109/ICACSIS.2014.7065828
DO - 10.1109/ICACSIS.2014.7065828
M3 - Conference contribution
AN - SCOPUS:84946685322
T3 - Proceedings - ICACSIS 2014: 2014 International Conference on Advanced Computer Science and Information Systems
SP - 239
EP - 245
BT - Proceedings - ICACSIS 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2014
Y2 - 18 October 2014 through 19 October 2014
ER -