Building “Yarub” Library for Arabic NLP Purposes
August 12, 2021
Data is the secret ingredient that can break or make the recipe when it comes to machine learning models. Here in our project ‘Building Open Source NLP Libraries & Tools for the Arabic Language,’ that ingredient was not exactly served on a dish of gold, so we will take you on a journey of collecting our data.
Because of the nature of the Arabic language and the complexity of its structure, besides the fact of the presence of many dialects, it was not an easy task and the decision taken to begin with only using Modern Standard Arabic (MSA) in developing our models for the first phase of the project and here appeared the first obscure for data collection which is the lack of the availability of pure MSA data.
We will talk about our data collection journey, which started a little bit not knowing what we were specifically trying to reach. Still, because of Omdena’s successful bottom-Up strategy, the puzzle pieces started to show the whole picture. In the following lines, we will try to concentrate on our gained experience to collect the data.
Collecting Modern Standard Arabic data
Training data is the data used to train an algorithm or machine learning model to predict the outcome as per our design model.
Test data is used to measure the performance, such as accuracy or efficiency, of the algorithm used to train the machine.
We aimed to collect MSA datasets specified for the various models of our Arabic NLP toolkit, which are:
- Sentiment analysis
- Morphological modeling
- Named Entity Recognition (NER)
- Dialect identification
- Word embeddings
- Lemmatization
- Speech tagging
Our approach to building an Arabic NLP library
- Search for available suitable datasets.
- Scrape MSA text from various sources.
- Prepare the scraped data to be suitable for various models.
Using open-source NLP datasets
You can find out here Yarub Training datasets.
Pros
We can easily use and append with the existing new dataset.
Challenges
The existing dataset has different labels and a hybrid of classical and modern standard Arabic, which we need to separate and apply pre-processing tasks with validation.
Web scraping and data acquisition
Before we get in-depth for our work in data scraping, we need to point to that a crucial part of web scraping is to be an ethical one as we should not ever scrape a site its owner does not permit to crawl his website. You can easily check that by providing a slash and robots.txt after the URL of the website you need to crawl, and if it allowed, you would get something like that:
User-agent: * Allow: /
For more details about that, you can review Google Search Central about robots.txt files documentation.
Using this method has pros and cons:
Pros
As we aimed only to use MSA, scraping data will give us some control of the content of our datasets by choosing the sources.
Challenges
Data annotation and labeling are extremely time-consuming and require a lot of collaborators to be achieved.
Scraping news from newspaper websites
By using a python package designed for news articles scraping called newspaper, and you can install using the following command:
pip install newspaper
Then follow those directions as provided by the documentation page:
import newspaper news_paper = newspaper.build('Here the newspaper url') # ~15 seconds for article in news_paper.articles: print article.url # filters to only valid news urls print news_paper.size() # number of articles print news_paper.category_urls() print news_paper.feed_urls() # ^ categories and feeds are cached for a day (adjustable) # ^ searches entire newspaper sitemap to find the feeds, not just homepage #### build articles, then download, parse, and perform NLP for article in news_paper.articles[:5]: article.download() # take's a while if you're downloading 1K+ articles print news_paper.articles[0].html ### parse an article for its text, authors, etc first_article = news_paper.articles[0] first_article.parse() print first_article.text print first_article.top_img print first_article.authors print first_article.title
Scraping Arabic books quotation website
We used noor-book.com, which has a section that allows readers to write quotes from books they read. The site contains almost 80,000 quotes.
We used Selenium library and BeautifulSoup to scrape this site, and the main issue we faced is the site uses infinite scrolling, which means that more quotes only show when you scroll the page down.
Scraping infinite scrolling can be very challenging, and after exploring lots of codes and methods, we will try in the following lines explain the code used to scrape the site :
First, we will import the required libraries:
It is a miscellaneous operating system interface
Here we used (os.environ) as a mapping object representing the string environment.
We will import ‘web driver’ from the Selenium library. Still, first, you need to add the folder containing WebDriver’s binaries to your system’s path with the help of Selenium documentation here.
Used it to apply a sleep function to give the server the time needed to perform the given requests without being overloaded.
Its use is accompanied by some knowledge about the structure of a web page and some HTML tags.
We need to define where the parts we need to scrape lay in, and you can use the BeautifulSoup library to parse it.
import os from selenium import webdriver from selenium.webdriver.common.keys import Keys import time from bs4 import BeautifulSoup import pandas as PD chromedriver = "/home/chromedriver" os.environ["webdriver.chrome.driver"] = chromrdriver driver = webdriver.Chrome("D:chromedriver_win32chromedriver.exe") driver.get("https://www.noor-book.com/book-quotes") screen_height = driver.execute_script("return window.screen.height;") # get the screen height of the web ScrollNumber = ”””” Here to put the number of scrolls needed to scrape all content wanted from the site””” for i in range(1,ScrollNumber): driver.execute_script("window.scrollTo(0,100000)") time.sleep(0.5)### quotes = [] soup = BeautifulSoup(driver.page_source) for a in soup.find_all('div', attrs={'class':'quote-content-child'}): qoute=a.find('span', attrs={'class':'more'}) qoutes.append(qoute.text) driver.close()
Scraping tweets from Twitter
The idea of using scraped tweets comes from the idea to aim accounts mainly use MSA in their tweets as:
- Official authority’s accounts.
- Politicians.
- Newspaper accounts.
Here we used Tweepy to query Twitter’s API, which you must have a Twitter developer account to be capable of using, and for that, you have to:
First, have a Twitter account.
Secondly, follow the steps provided here to apply for one, and you will be guided through the steps and asked to describe in your own words what you are building.
After you got your account, you can follow the following code:
import tweepy from tweepy import OAuthHandler # These are hidden to comply with Twitter's API terms and conditions consumer_key = '-----------’ consumer_secret = '---------’ access_token = '----------' access_secret = '--------' auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.API(auth, wait_on_rate_limit=True)
You can get the tweets for whatever account you want, but with a limit of 3200 tweets from the latest tweets, you can not scrape more than 18,000 tweets per a 15-minute window.
We first manually reviewed the selected accounts to make sure they only use MSA in their tweets or mostly as it is impossible to get 100 percent sure. After determining what account you want to get tweets from, you can use tweepy documentation to start scraping.
Data cleaning and processing
As most of the scraped data may contain some undesirable features to be used in training the various NLP models, some data cleaning become necessary, such as removing emojis, slashes, dashes, digits, and in our case, removing Latin letters, and for that, we used :
‘re’ Regular expression operations as shown in its documentation.
Using Doccano for data labeling
We have tried to use Doccano software for labeling web scraped data sets, but it was not so accurate at facing problems with consistency.
After the vicissitude process, we have successfully achieved an MSA scraped dataset and labeling as per the requirement.
PyPI Yarub Library
We have developed one python code that consists of several functions for the Yarub library we implemented. Now we can download, extract and load training datasets using the “Yarub” Library.
import io import os import struct import zipfile import requests def load_sentiment(): if not os.path.exists("Sentiment_Analysis/"): os.mkdir("Sentiment_Analysis/") print("[INFO] Downloading") url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Omdena-seniment-analysis-Datasets.zip" r = requests.get(url) local_filename = "Sentiment_Analysis/Omdena-seniment-analysis-Datasets.zip" with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, 'wb') as f: f.write(r.content) print("[INFO] Extracting") z = zipfile.ZipFile("Sentiment_Analysis/Omdena-seniment-analysis-Datasets.zip") z.extractall("Sentiment_Analysis/") print("[INFO] Done") #os.remove("Sentiment_Analysis/Omdena-seniment-analysis-Datasets.zip") def load_ner(): if not os.path.exists("Entity_Recognition/") os.mkdir("Entity_Recognition/") print("[INFO] Downloading") url = r"https://github.com/messi313/Omdena-Dataset/raw/main/NER_data_spacy.json" r = requests.get(url) local_filename = "Entity_Recognition/NER_data_spacy.json" with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, 'wb') as f: f.write(r.content) print("[INFO] Done") def load_dialect(): if not os.path.exists("dialect/"): os.mkdir("dialect/") print("[INFO] Downloading") url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Final_Dialect_Dataset.zip" r = requests.get(url) local_filename = r"dialect/Final_Dialect_Dataset.zip" with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, 'wb') as f: f.write(r.content) print("[INFO] Extracting") z = zipfile.ZipFile("dialect/Final_Dialect_Dataset.zip") z.extractall("dialect/") #os.remove("dialect/Final_Dialect_Dataset.zip") print("[INFO] Done") def load_word_embedding(): if not os.path.exists("Word_Embedding/"): os.mkdir("Word_Embedding/") print("[INFO] Downloading") url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Word%20Embedding.zip" r = requests.get(url) local_filename = r"Word_Embedding/Word Embedding.zip" with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, 'wb') as f: f.write(r.content) print("[INFO] Extracting") z = zipfile.ZipFile("Word_Embedding/Word Embedding.zip") z.extractall("Word_Embedding/") #os.remove("Word_Embedding/Word Embedding.zip") print("[INFO] Done") def load_pos(): if not os.path.exists("Parts_of_speech/"): os.mkdir("Parts_of_speech/") print("[INFO] Downloading") url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Final_Pos.zip" r = requests.get(url) local_filename = r"Parts_of_speech/Final_Pos.zip" with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, 'wb') as f: f.write(r.content) print("[INFO] Extracting") z = zipfile.ZipFile("Parts_of_speech/Final_Pos.zip") z.extractall("Parts_of_speech/") #os.remove("Parts_of_speech/pos_data.zip") print("[INFO] Done") def load_morphology(): if not os.path.exists("Morphology/"): os.mkdir("Morphology/") print("[INFO] Downloading") url = r"https://github.com/messi313/Omdena-Dataset/raw/main/final_morpho_data.zip" r = requests.get(url) local_filename = r"Morphology/final_morpho_data.zip" with requests.get(url, stream=True) as r: r.raise_for_status() with open(local_filename, 'wb') as f: f.write(r.content) print("[INFO] Extracting") z = zipfile.ZipFile("Morphology/final_morpho_data.zip") z.extractall("Morphology/") #os.remove("Morphology/final_morpho_data.zip") print("[INFO] Done")
Most often hosted at the Python Packaging Index (PyPI), historically known as the Cheese Shop. At PyPI, you can find everything from Hello World to advanced deep learning libraries.
Here you can find out about our PyPI Yarub Library.
Conclusion
We want to point to that the most important step was to know what are the specifications of the required data by each task in the project and then come applying the previously mentioned techniques, and that was only possible by communication and careful listening to the members of the other tasks and repeatedly going back to them to assure that we are on the right track.
Also, After the success in our mention in that project, it is not the end of the road as we are going to develop furthermore functionality related to our training dataset. We will add an Arabic image training dataset for computer vision challenges and research topics.
In the end, enjoy this video that will take you on a short journey through our project.
You might also like