Project Duration: 06 Sep 2023 - 09 Oct 2023
Amharic is the official language of Ethiopia, spoken by over 30 million people. However, creating chatbots or information retrieval systems for this language is challenging due to the limited availability of datasets for the Amharic text. This scarcity of data makes it difficult to effectively utilize the vast amount of content available in Amharic across various mediums such as websites and books for applications such as chatbots and search engines.
As a result, Amharic is considered relatively low-resource in the natural language processing (NLP) community, hindering the development of NLP applications for the language.
Data Collection: The AmQA dataset currently contains 2,628 question-answer pairs over 378 Wikipedia articles. The dataset will be expanded by collecting additional question-answer pairs from other sources to increase its size to 8,000 or more data points. The new data will be collected from various sources like news articles, social media platforms, and other online resources
Data Preprocessing: clean and preprocess the collected data in order to make it suitable for ML tasks.
(Modelling) Fine-tuning a Multilingual Model: A robust multi-lingual model will be fine-tuned on the expanded AmQA dataset. The model will be fine-tuned using transfer learning, where the pre-trained model’s parameters will be updated to learn the specific features of the AmQA dataset.
Evaluation and deployment: The performance of the pipeline will be evaluated on the AmQA dataset using standard evaluation metrics such as F1 score, Precision, Recall, and others. The results will be compared to state-of-the-art models to determine the effectiveness of the approach. To demonstrate the performance, a simple web interface will be built
NLP, model fine-tuning, data collection and cleaning, collaboration, leadership