Using AI to Translate Data Science Content into Arabic

Local Chapter Giza, Egypt Chapter

Status: Completed

Project Duration: 08 Nov 2021 - 14 Dec 2021

Open Source resources available from this project

Project background.

Data science has made tremendous progress in the last few years, which makes it hard for translation efforts to catch up. While learning in a second language is possible, it isn’t as effective as learning in your native language.

The problem.

Use Deep Learning to help humans in translating more Data Science content.

The results should help begin an effort to translate more data science content into Arabic, helping students understand complex topics faster and more easily.  The translation model can be continuously improved, gradually decreasing the effort needed by humans to edit the machine translated articles, and allowing more content to be available into Arabic.

Project goals.

- Collect data about available Arabic resources explaining data science. Decide on one or a few under-represented topics in data science to work on translating into Arabic.
- Apply Neural Machine Translation to translate data science blogs, articles, and lecture notes in the chosen under-represented topics from English to Arabic.
- Collect parallel corpora consisting of text content in the chosen field that has been translated from English to Arabic by an expert human. Use these corpora to further improve the model's performance (e.g. by fine-tuning a pre-trained model).
- Create a website to host the translated articles.

(Ideally, it would have Wikipedia-like features for users to improve machine-translated articles which could then be used as input for re-training the Neural Machine Translation model)

Project plan.

  • Week 1

    – Collecting data about Arabic data science content.
    – Researching Neural Machine Translation and selecting a model architecture.
    – Exploring keyword extraction for technical terms.

  • Week 2

    – Choosing a field in data science that’s underrepresented in Arabic.
    – Applying the chosen Neural Machine Translation model to the chosen field.
    – Collecting parallel corpora in English and Arabic for the chosen field.

  • Week 3

    – Fine-tuning the model using collected parallel corpora.
    – Trying keyword extraction for improved translation.
    – Research different options for hosting and editing the translated articles.

  • Week 4

    – Compare model performance to alternatives and publish results.
    – Integrate and document the system.
    – Deploy the articles on a web app.

Learning outcomes.

1. Quantifying the state of Arabic content in data science and the fields that are still lacking content.

2. Learning, using, and improving Neural Machine Translation models for domain-specific data.

3. Learning how to deploy the results to a website for everyone to benefit.

Share project on: