Machine Natural Language Translation Using Wikipedia As A Parallel Corpus: A Focus On Swahili


  • Department: Management
  • Project ID: MGT0079
  • Access Fee: ₦5,000
  • Pages: 114 Pages
  • Reference: YES
  • Format: Microsoft Word
  • Views: 360
Get this Project Materials

The government of Kenya has undertaken an ambitious project to equip children with laptops and tablets for the purposes of facilitating electronic based learning. This initiative can only bear fruit provided that there is content relevant to the studies being undertaken. Many Kenyans learn English as a second language. Swahili or other African languages is the mother tongue. Therefore, with content in Swahili, a better and deeper understanding of subject matter takes place. Much of the academic content already exists albeit in English. Therefore, translating this content is the most practical method of getting the content in Swahili. This is especially so since the content is not necessarily new, but just needs to be interpreted.

There already exist machine translation engines, such as Microsoft Translator and Google Translate, which aim to make this task easier. However, African languages are generally under-represented in these engines. The translation results they produce are comparatively inaccurate when it comes to translating content to African languages. They are even more inaccurate when translating academic type of content. This can largely be attributed to the source of data used to train the translation engines. Many machine translation engines make use of corpora made up of phrases that are found in every day speech, into which academic terms are not adequately incorporated.

Wikipedia, an on-line crowd sourced encyclopedia, offers very good sources of data for purposes of translation works.  This study has shown that using Wikipedia as  a corpus can provide a viable source of data for academic related translations and specifically so when it comes to African languages.

Therefore, this project modeled an English to Swahili translation engine that uses Wikipedia as a source of translation corpus data. As an emphasis, this study did not set out to create yet another translation engine altogether, but to just improve on, and complement, a small aspect of the current existing engines. The approach that was used was to compare same language articles in Wikipedia and build a parallel corpus which is then used to create a translation database. It is worth noting that Wikipedia on its own cannot provide a comprehensive data set for

any machine translation engine. As proof of concept this model shows English to Swahili translations and presents preliminary results here. Indeed, further work is required for more accurate output alignment and combining the output to ensure fluency and accuracy.

This study was further motivated by the directive of the Communications Authority of Kenya that aims towards having at least 60% of the media content being local. This content therefore needs to be translated into local languages for presentation purposes. The study proposes a solution that can be scaled to learn and translate other local languages.

Finally it is worth noting that Kenya, like many other developing countries, imports numerous products from foreign countries. Many of these products have their labels and instructions written in these foreign languages, more-so English. This poses a potential threat to consumers who do not understand these languages for example in the case of medical drugs. 

  • Department: Management
  • Project ID: MGT0079
  • Access Fee: ₦5,000
  • Pages: 114 Pages
  • Reference: YES
  • Format: Microsoft Word
  • Views: 360
Get this Project Materials
whatsappWhatsApp Us