NLP Data Preparation: From Regex to Word Cloud Packages and Data Visualization
March 11, 2021
“REGEX‘’ and “Word Clouds” for Natural Language Processing (NLP) data preparation? `Yesss! Regex, short for “the regular expression”, is not an old technique to find and extract text data. It is still one of the basic techniques used in scraping or even in hot AI topics like NLP. Also, word clouds are still used in the initial stage of almost all NLP projects.
Author: Rinki Nag
Regular expressions are extremely useful in extracting information from any text by searching for one or more matches of a specific pattern.
Fields of applications range from validation to parsing/replacing strings, to translating data to other formats, NLP, and web scraping.
Visualization is also one of the very important steps of all Machine Learning projects and we will cover this and some analyses in the tutorial.
Here is a list of the topics covered and an interesting project, which we will build step by step.
Curious?
Let´s start with the concepts and build the final project in the six steps listed below:
- Scraping news articles on “AI” from Google news
- Cleaning the data using regex and generate word cloud on clean text
- Visualization of the unigram, bigram, and trigram on the text data
- Parts of speech tagging
- Named entity recognition
- Sentiment analysis
1. Scraping news articles on “AI” from Google news
If we want to scrape articles from Google news, there are a few parameters that we can use to build a search query.
All Google search URLs start with https://www.google.com/search?
# Build a query topic=”AI” numResults=1000 url ="https://www.google.com/search?q="+topic+"&tbm=nws&hl=en&num="+str(numResults)
q — this is the query topic, i.e., q=AI if you’re searching for Artificial Intelligence news
hl — the interface language, i.e., hl=en for English
tbm — to be matched, here we need tbm=nws to search for news items.
There’s a whole lot of other things one can match.
For instance, app for applications, blg for blogs, bks for books, isch for images, plcs for places, vid for videos, shop for shopping, and rcp for recipes.
num — controls the number of results shown. If you only want 10 results shown, num=10
To scrape 1000 articles, clean and visualize them using a word cloud.
Start scraping the articles using the URL we built above.
response = requests.get(url) soup = BeautifulSoup(response.content, ‘html.parser’)
2. Cleaning the data using regex & generating a word cloud
Now we have all the articles but they are in HTML format, which we have to clean and transform in a format that we can analyze and visualize through a word cloud.
Before we will revise a few of the regex concepts. We know that regular expressions are useful to replace or remove characters.
1. Remove the brackets then use regex
re.sub(‘[([].*?[)]]’, ‘ ‘, text)
2. Remove extra spaces
re.sub(“s+”,” “, doc)
3. Remove punctuations
re.sub(“[^-9A-Za-z ]”, “” , text)
4. Convert data in lower or upper
text.lower() text.upper()
5. Remove special characters using regex
pattern = r’[^a-zA-Z0–9s]’ if not remove_digits else r’[^a-zA-Zs]’ re.sub(pattern, ‘’, text)
6. Stopwords include: I, he, she, and, but, was were, being, have, etc, which do not add meaning to the data. So these words (tokenizing the text) must be removed which helps to reduce the features from our data.
stopword_list = stopwords.words(‘english’) tokens = nltk.word_tokenize(text) tokens = [token.strip() for token in tokens] ‘ ‘.join([token for token in tokens if token not in stopword_list])
Let’s start cleaning our downloaded data!
# we are finding all data under division tags with class ZINbbc results = soup.find_all(“div”, attrs = {“class”: “ZINbbc”})
# we will take out all data from description from class s3v9rd descriptions = [] for result in results: try: description = result.find(“div”, attrs={“class”:”s3v9rd”}).get_text() if description != “”: descriptions.append(description) except: continue
# Join all text into one to start cleaning them together for making word cloud text = “”.join(descriptions)
When we print the variable text it’s like:
To generate the word cloud and see the most famous words related to AI articles.
# To generate the wordcloud we can use this code wordcloud = WordCloud(stopwords=STOPWORDS).generate(text_clean) plt.imshow(wordcloud, interpolation=”bilinear”) plt.axis(“off”) plt.show()
Let’s try a different cleaning method to generate a word cloud on the same text!
Here we will choose only nouns and make every text in lowercase as done before
import nltk nltk.download(‘averaged_perceptron_tagger’) # function to test if something is a noun is_noun = lambda pos: pos[:2] == ‘NN’ # do the nlp stuff tokenized = nltk.word_tokenize(text) nouns = [word.lower() for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] # Joining them all text_noun = “ “.join(nouns) # Then print the wordcloud wordcloud = WordCloud(stopwords=STOPWORDS).generate(text_noun) plt.imshow(wordcloud, interpolation=”bilinear”) plt.axis(“off”) plt.show()
But what if we add some more stop words and make this word cloud more meaningful?
# By adding some more stops words to the list wordcloud = WordCloud(stopwords=set(list(STOPWORDS)+[‘day’,’ai’,’ago’,’hour’,’hours’,’days’])).generate(text_noun) plt.imshow(wordcloud, interpolation=”bilinear”) plt.axis(“off”) plt.show()
The word cloud is more meaningful now.
3. Visualizing the unigram, bigram, and trigram on the text data
To visualize the n-grams. We will start extracting N-Gram features and see their distribution. N-Gram is used to describe the number of words used as observation points, e.g., unigram means singly-worded, bigram means the 2-worded phrase, and trigram means 3-worded phrase. This helps us to understand which words most occur together and make our text cleaner and understanding the text distribution.
# Visualizing unigrams from sklearn.feature_extraction.text import CountVectorizer import pandas as pd from plotly.offline import iplot def get_top_n_words(corpus, n=None): vec = CountVectorizer(stop_words = ‘english’).fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) return words_freq[:n] common_words = get_top_n_words(descriptions, 20) for word, freq in common_words: print(word, freq) df2 = pd.DataFrame(common_words, columns = [‘ReviewText’ , ‘count’]) df2.groupby(‘ReviewText’).sum()[‘count’].sort_values(ascending=False).plot(kind=’bar’)
As we have seen earlier in the word cloud; day, ai, etc are the most occurring words. We can see the most occurring words are highlighted in bold in the word cloud.
Now, we will try visualizing bigrams, which means the two most simultaneously occurring words.
def get_top_n_bigram(corpus, n=None): vec = CountVectorizer(ngram_range=(2, 2), stop_words=’english’).fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) return words_freq[:n] common_words = get_top_n_bigram(descriptions, 20) for word, freq in common_words: print(word, freq) df4 = pd.DataFrame(common_words, columns = [‘ReviewText’ , ‘count’]) df4.groupby(‘ReviewText’).sum()[‘count’].sort_values(ascending=False).plot(kind=’bar’)
The same applies to trigrams.
# Visualizing trigrams def get_top_n_bigram(corpus, n=None): vec = CountVectorizer(ngram_range=(3, 3), stop_words=’english’).fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) return words_freq[:n] common_words = get_top_n_bigram(descriptions, 20) for word, freq in common_words: print(word, freq) df4 = pd.DataFrame(common_words, columns = [‘ReviewText’ , ‘count’]) df4.groupby(‘ReviewText’).sum()[‘count’].sort_values(ascending=False).plot(kind=’bar’)
This analysis helps us to understand which words are most repeated like “days ago”, “ai”, “12 hours ago”, etc which won’t help us to understand the data better or build a better model, so it’s better to drop them.
4. Applying parts of speech tagging distribution
Let us do parts of speech tagging and see which parts of speech are most occurring in the text corpus.
from textblob import TextBlob blob = TextBlob(str(descriptions)) pos_df = pd.DataFrame(blob.tags, columns = [‘word’ , ‘pos’]) pos_df = pos_df.pos.value_counts()[:20] pos_df.plot(kind=’bar’)
5. Using Named Entity Recognition
Named Entity Recognition (NER) is a standard NLP problem that involves spotting named entities (people, places, organizations, etc.) from a chunk of text, and classifying them into a predefined set of categories.
When we use this NER our code can understand which word means a person or organization etc.
nltk.download(‘words’) nltk.download(‘maxent_ne_chunker’) ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(text_clean))) print(ne_tree)
This gives an output of a tree that helps us understand the NER structure and algorithm of our corpus.
Mostly the output will be in form of IOB tags
What do these IOB tags mean?
B-NP: the beginning of a noun phrase
I-NP: describes that the word is inside of the current noun phrase.
O: end of the sentence.
B-VP and I-VP: beginning and inside of a verb phrase.
6. Applying Sentiment Analysis
Let us try to use textblob to get the sentiment from the description of the AI articles.
from textblob import TextBlob # compute sentiment scores (polarity) and labels sentiment_scores_tb = [round(TextBlob(article).sentiment.polarity, 3) for article in descriptions] sentiment_category_tb = [‘positive’ if score > 0 else ‘negative’ if score < 0 else ‘neutral’ for score in sentiment_scores_tb] # We don’t have any labels to compare the performance of our model but we can definitely check it ourselves. # Let us save all these in a data frame and check the model output sentiment=pd.DataFrame() sentiment[‘article’]=descriptions sentiment[‘sentiments’]=sentiment_category_tb
We can see that our model is performing fairly well as if you read the 4th index description, which states “AI rising” kind words to be some threat kind sentiment.
We can also try models like Affin, these tools are useful when you don’t have any training data but want to quickly try out some ML models.
The code is available as a Jupyter Notebook here.
You might also like