Natural Language Processing for Ethiopian Languages

Local Chapter Addis Ababa, Ethiopia Chapter

Coordinated byEthiopia ,

Status: Completed

Project Duration: 23 Mar 2023 - 30 Apr 2023

Open Source resources available from this project

Project background.

Ethiopia, the oldest independent country in Africa and the only one in the continent with its own alphabet, has a population of almost 120 Million people. Its a land of enormous diversity with more than 80 languages and over 200 dialects. Amharic or Amharigna, is one of the working languages in the country along with Oromigna and Tigrigna.

The rest of the world is rapidly adopting Machine Learning and AI to take advantage of the available language data. Countries, Ethiopia, with low-resource languages remained behind. It’s time for them to catch up. The ability to effectively leverage current language technologies can benefit in a variety of ways such as by increasing literacy, preserving legacy languages, doing large-scale analysis, improving efficiency, etc. There is a better amount of data available on the internet today than ever before, and leveraging it to build useful projects remained a challenge.

The problem.

The current problem with Amharic language processing is that there are not enough works for public use. Most research projects remained on the shelf of universities. This project, which is the first in a series of NLP-related projects on local languages, aims to build and consolidate capacity in Amharic language processing by leveraging the latest available data.

Project goals.

The goal of the project is to build an end-to-end NLP project. Particularly, We will start with collecting and organizing data, then we will continue with building tools for preprocessing, and later on, we will conclude the project by building a classification model for Amharic news.

Project plan.

  • Week 1

    Initiation, platforming and teaming

  • Week 2

    Data collection stage

  • Week 3

    Data collection and processing

  • Week 4


Learning outcomes.

Corpus preparation End-to-end NLP project with low resource language (Amharic) Working on a project

Share project on: