Amharic is the official language of Ethiopia, spoken by over 30 million people. However, creating chatbots or information retrieval systems for this language is challenging due to the limited availability of datasets for the Amharic text. This scarcity of data makes it difficult to effectively utilize the vast amount of content available in Amharic across various mediums such as websites and books for applications such as chatbots and search engines.
As a result, Amharic is considered relatively low-resource in the natural language processing (NLP) community, hindering the development of NLP applications for the language.
This project aims to enhance the performance of Amharic question answering by expanding the current AmQA dataset and fine-tuning a more robust multilingual model on it. This will help improve the ability of NLP models to extract relevant information from Amharic text, enabling the development of accurate and efficient chatbots and search engines for the language.
Data Collection: The AmQA dataset currently contains 2,628 question-answer pairs over 378 Wikipedia articles. The dataset will be expanded by collecting additional question-answer pairs from other sources to increase its size to 8,000 or more data points. The new data will be collected from various sources like news articles, social media platforms, and other online resources
(Modelling) Fine-tuning a Multilingual Model: A robust multi-lingual model will be fine-tuned on the expanded AmQA dataset. The model will be fine-tuned using transfer learning, where the pre-trained model’s parameters will be updated to learn the specific features of the AmQA dataset.
Evaluation and deployment: The performance of the pipeline will be evaluated on the AmQA dataset using standard evaluation metrics such as F1 score, Precision, Recall, and others. The results will be compared to state-of-the-art models to determine the effectiveness of the approach. To demonstrate the performance, a simple web interface will be built
Modelling, and Evaluation