AI Insights

NLP Data Preparation: From Regex to Word Cloud Packages and Data Visualization

March 11, 2021


article featured image

“REGEX‘’ and “Word Clouds” for Natural Language Processing (NLP) data preparation? `Yesss! Regex, short for “the regular expression”, is not an old technique to find and extract text data. It is still one of the basic techniques used in scraping or even in hot AI topics like NLP. Also, word clouds are still used in the initial stage of almost all NLP projects.

Author: Rinki Nag

Regular expressions are extremely useful in extracting information from any text by searching for one or more matches of a specific pattern.

Fields of applications range from validation to parsing/replacing strings, to translating data to other formats, NLP, and web scraping.

Visualization is also one of the very important steps of all Machine Learning projects and we will cover this and some analyses in the tutorial.

Here is a list of the topics covered and an interesting project, which we will build step by step.

Curious?

Let´s start with the concepts and build the final project in the six steps listed below:

  • Scraping news articles on “AI” from Google news
  • Cleaning the data using regex and generate word cloud on clean text
  • Visualization of the unigram, bigram, and trigram on the text data
  • Parts of speech tagging
  • Named entity recognition
  • Sentiment analysis

1. Scraping news articles on “AI” from Google news

If we want to scrape articles from Google news, there are a few parameters that we can use to build a search query.

All Google search URLs start with https://www.google.com/search?

# Build a query
topic=”AI”
numResults=1000
url ="https://www.google.com/search?q="+topic+"&tbm=nws&hl=en&num="+str(numResults)

q — this is the query topic, i.e., q=AI if you’re searching for Artificial Intelligence news

hl — the interface language, i.e., hl=en for English

tbm — to be matched, here we need tbm=nws to search for news items.

There’s a whole lot of other things one can match.

For instance, app for applications, blg for blogs, bks for books, isch for images, plcs for places, vid for videos, shop for shopping, and rcp for recipes.

num — controls the number of results shown. If you only want 10 results shown, num=10

To scrape 1000 articles, clean and visualize them using a word cloud.

Start scraping the articles using the URL we built above.

response = requests.get(url)
soup = BeautifulSoup(response.content, ‘html.parser’)
Web scraping resources – Source: octaparse

Web scraping resources – Source: octaparse

2. Cleaning the data using regex & generating a word cloud

Now we have all the articles but they are in HTML format, which we have to clean and transform in a format that we can analyze and visualize through a word cloud.

Before we will revise a few of the regex concepts. We know that regular expressions are useful to replace or remove characters.

1. Remove the brackets then use regex

re.sub(‘[([].*?[)]]’, ‘ ‘, text)

2. Remove extra spaces

re.sub(“s+”,” “, doc)

3. Remove punctuations

re.sub(“[^-9A-Za-z ]”, “” , text)

4. Convert data in lower or upper

text.lower()
text.upper()

5. Remove special characters using regex

pattern = r’[^a-zA-Z0–9s]’ if not remove_digits else r’[^a-zA-Zs]’ re.sub(pattern, ‘’, text)

6. Stopwords include: I, he, she, and, but, was were, being, have, etc, which do not add meaning to the data. So these words (tokenizing the text) must be removed which helps to reduce the features from our data.

stopword_list = stopwords.words(‘english’)
tokens = nltk.word_tokenize(text)
tokens = [token.strip() for token in tokens]
‘ ‘.join([token for token in tokens if token not in stopword_list])

Let’s start cleaning our downloaded data!

# we are finding all data under division tags with class ZINbbc
results = soup.find_all(“div”, attrs = {“class”: “ZINbbc”})
# we will take out all data from description from class s3v9rd
descriptions = []
for result in results:
    try:
       description = result.find(“div”, attrs={“class”:”s3v9rd”}).get_text()
       if description != “”:
            descriptions.append(description)
    except:
        continue
# Join all text into one to start cleaning them together for making word cloud
text = “”.join(descriptions)

When we print the variable text it’s like:

Google scraping output for the word “AI” – Source: Omdena, tutorials

Google scraping output for the word “AI” – Source: Omdena, tutorials

To generate the word cloud and see the most famous words related to AI articles.

# To generate the wordcloud we can use this code
wordcloud = WordCloud(stopwords=STOPWORDS).generate(text_clean)
plt.imshow(wordcloud, interpolation=”bilinear”)
plt.axis(“off”)
plt.show()
This is a word cloud generated from the AI articles collected - Source: Omdena

This is a word cloud generated from the AI articles collected – Source: Omdena

Let’s try a different cleaning method to generate a word cloud on the same text!

Here we will choose only nouns and make every text in lowercase as done before

import nltk
nltk.download(‘averaged_perceptron_tagger’)
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == ‘NN’
# do the nlp stuff
tokenized = nltk.word_tokenize(text)
nouns = [word.lower() for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
# Joining them all
text_noun = “ “.join(nouns)
# Then print the wordcloud
wordcloud = WordCloud(stopwords=STOPWORDS).generate(text_noun)
plt.imshow(wordcloud, interpolation=”bilinear”)
plt.axis(“off”)
plt.show()
This is a word cloud generated from the AI articles collected using only nouns – Source: Omdena

This is a word cloud generated from the AI articles collected using only nouns – Source: Omdena

But what if we add some more stop words and make this word cloud more meaningful?

# By adding some more stops words to the list
wordcloud = WordCloud(stopwords=set(list(STOPWORDS)+[‘day’,’ai’,’ago’,’hour’,’hours’,’days’])).generate(text_noun)
plt.imshow(wordcloud, interpolation=”bilinear”)
plt.axis(“off”)
plt.show()
This is a word cloud generated from the AI articles collected and after custom filtering words – Source: Omdena

This is a word cloud generated from the AI articles collected and after custom filtering words – Source: Omdena

The word cloud is more meaningful now.

3. Visualizing the unigram, bigram, and trigram on the text data

To visualize the n-grams. We will start extracting N-Gram features and see their distribution. N-Gram is used to describe the number of words used as observation points, e.g., unigram means singly-worded, bigram means the 2-worded phrase, and trigram means 3-worded phrase. This helps us to understand which words most occur together and make our text cleaner and understanding the text distribution.

# Visualizing unigrams
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from plotly.offline import iplot
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words = ‘english’).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(descriptions, 20)
for word, freq in common_words:
    print(word, freq)

df2 = pd.DataFrame(common_words, columns = [‘ReviewText’ , ‘count’])
df2.groupby(‘ReviewText’).sum()[‘count’].sort_values(ascending=False).plot(kind=’bar’)

As we have seen earlier in the word cloud; day, ai, etc are the most occurring words. We can see the most occurring words are highlighted in bold in the word cloud.

Now, we will try visualizing bigrams, which means the two most simultaneously occurring words.

def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words=’english’).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_bigram(descriptions, 20)
for word, freq in common_words:
    print(word, freq)

df4 = pd.DataFrame(common_words, columns = [‘ReviewText’ , ‘count’])
df4.groupby(‘ReviewText’).sum()[‘count’].sort_values(ascending=False).plot(kind=’bar’)

The same applies to trigrams.

# Visualizing trigrams
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words=’english’).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_bigram(descriptions, 20)
for word, freq in common_words:
    print(word, freq)

df4 = pd.DataFrame(common_words, columns = [‘ReviewText’ , ‘count’])
df4.groupby(‘ReviewText’).sum()[‘count’].sort_values(ascending=False).plot(kind=’bar’)

This analysis helps us to understand which words are most repeated like “days ago”, “ai”, “12 hours ago”, etc which won’t help us to understand the data better or build a better model, so it’s better to drop them.

4. Applying parts of speech tagging distribution

Let us do parts of speech tagging and see which parts of speech are most occurring in the text corpus.

from textblob import TextBlob
blob = TextBlob(str(descriptions))
pos_df = pd.DataFrame(blob.tags, columns = [‘word’ , ‘pos’])
pos_df = pos_df.pos.value_counts()[:20]
pos_df.plot(kind=’bar’)

5. Using Named Entity Recognition

Named Entity Recognition (NER) is a standard NLP problem that involves spotting named entities (people, places, organizations, etc.) from a chunk of text, and classifying them into a predefined set of categories.

Name entity from splunk.com

Name entity from splunk.com

When we use this NER our code can understand which word means a person or organization etc.

nltk.download(‘words’)
nltk.download(‘maxent_ne_chunker’)
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(text_clean)))
print(ne_tree)

This gives an output of a tree that helps us understand the NER structure and algorithm of our corpus.

Mostly the output will be in form of IOB tags

What do these IOB tags mean?

B-NP: the beginning of a noun phrase

I-NP: describes that the word is inside of the current noun phrase.

O: end of the sentence.

B-VP and I-VP: beginning and inside of a verb phrase.

6. Applying Sentiment Analysis

Sentiment analysis from translatemedia.com

Sentiment analysis from translatemedia.com

Let us try to use textblob to get the sentiment from the description of the AI articles.

from textblob import TextBlob
# compute sentiment scores (polarity) and labels
sentiment_scores_tb = [round(TextBlob(article).sentiment.polarity, 3) for article in descriptions]
sentiment_category_tb = [‘positive’ if score > 0
                        else ‘negative’ if score < 0
                        else ‘neutral’
                        for score in sentiment_scores_tb]

# We don’t have any labels to compare the performance of our model but we can definitely check it ourselves.
# Let us save all these in a data frame and check the model output

sentiment=pd.DataFrame()
sentiment[‘article’]=descriptions
sentiment[‘sentiments’]=sentiment_category_tb

We can see that our model is performing fairly well as if you read the 4th index description, which states “AI rising” kind words to be some threat kind sentiment.

We can also try models like Affin, these tools are useful when you don’t have any training data but want to quickly try out some ML models.

The code is available as a Jupyter Notebook here.

Want to work with us too?

media card
Creating a Google Virtual Machine Instance to Reduce Dataset Size for Improved Visibility
media card
Harnessing AI to Monitor and Optimize Reforestation Efforts in Madagascar
media card
AI for Sustainable Farming: Tackling Greenhouse Gas Emissions and Empowering Responsible Finance
media card
A Beginner’s Guide to Exploratory Data Analysis with Python