In the Omdena-ACET challenge, we turned to online information sources, such as social media, newspapers, scientific articles, and websites of institutions involved with infrastructure. Each source provides an abundance of information that’s not possible to analyse manually. To efficiently explore such amounts of textual data, we used Natural Language Processing (NLP) and topic modelling.

Authors: Anna Koroleva, Ijeoma Ndu, code by Abhishek Singh 

 

A recent AI challenge organized by Omdena and ACET (African Center for Economic Transformation) aimed at predicting infrastructure needs in Africa. Infrastructure includes transport, energy, water, etc. There are 54 countries in Africa, each with its own needs.

Infrastructure-related parameters reported by governments cannot reflect all the needs of all the countries. How can we uncover such needs? One of the ways is to explore unstructured data, such as texts.

 

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that empowers computers to read and understand human languages. It aims to extract valuable information from texts to help decision-making. Most NLP techniques apply machine learning (ML) to derive meaning from texts.

NLP addresses many tasks, such as text classification (e.g. spam vs non-spam) or information extraction (e.g. finding dates, names of people, organisations, etc.). For the Omdena-ACET challenge, one of the most useful NLP applications was Topic Modelling — a text-mining technique that uses statistical methods to discover similarities between texts in a dataset.

Topic modelling is an unsupervised ML technique: it takes raw texts as input, without any labels assigned to them, and groups them into clusters according to their similarity. Each cluster corresponds to a “topic”. The model highlights the most prominent words in each cluster.

For example, if your cluster 1 contains words such as “transport”, “road”, “rail”, you can conclude that it corresponds to the topic “transport”.

 

Examples of words in topic modeling - Source: Omdena

Examples of words in topic modelling – Source: Omdena

 

In this blog, we will go through the main steps of topic modelling, as implemented in the Omdena-ACET challenge.

 

Pipeline implementation

The scheme below shows the main steps of our pipeline and the tools we used.

 

Implementation of a topic modelling pipeline - Source: Omdena

Implementation of a topic modelling pipeline – Source: Omdena

 

 

Data collection and Twitter

The main source of textual data in this challenge was Twitter. To obtain the data, we used Twint (https://github.com/twintproject/twint) — a tool that allows you to scrape Twitter without a developer account. For uninterrupted large-scale scraping, we connected to ngrok servers (https://ngrok.com) using Python colabcode package (https://pypi.org/project/colabcode/).

We scraped tweets mentioning Africa or one of its countries and one of a set of keywords related to infrastructure (e.g. “transport”, “road”), dating from 2011 to 2020.

You can apply topic modelling to virtually any textual data. However, one thing to keep in mind is that the number of topics cannot be greater than the number of texts.

 

Text in topic modeling hierarchy - Source: Omdena

Text in topic modelling hierarchy – Source: Omdena

 

Data cleaning

Data cleaning is a vital pre-processing step in text analysis. The performance of NLP models can be hindered by noise present in the data. It is hence a good idea to look at the texts you are working with to find out how to clean the data.

In our project, we set Twint scraping settings to collect English language tweets, but obtained some tweets in other languages, especially when the tweet is a sandwich of the English language and a local language. We used an additional tool (langdetect library: https://pypi.org/project/langdetect/) to filter out other languages.

Another issue to consider when working with social media data is how to deal with emojis, mentions (@MisterX), hashtags (#Africa), email addresses, and URLs. The solution depends on your specific task. In our project, we used Python re module to remove emojis, email addresses, URLs, and hash symbol so that for instance, “#Africa” becomes “Africa”.

An example code for data cleaning is shown below:

from langdetect import detect
import re
emoji_pattern = re.compile("["
   u"\U0001F600-\U0001F64F" # emoticons
   u"\U0001F300-\U0001F5FF" # symbols & pictographs
   u"\U0001F680-\U0001F6FF" # transport & map symbols
   u"\U0001F1E0-\U0001F1FF" # flags (iOS)
   u"\U00002702-\U000027B0"
   u"\U000024C2-\U0001F251"
   u"\U00002500-\U00002BEF" # chinese char
   u"\U0001f921-\U0001f937"
   u"\U00010000-\U0010ffff"
   u"\u2640-\u2642"
   u"\u2600-\u2B55"
   u"\u200d"
   u"\u23cf"
   u"\u23e9"
   u"\u231a"
   u"\ufe0f" # dingbats
   u"\u3030"
   "]+", flags=re.UNICODE)
email_pattern = re.compile("\S+@\S+\.\S{2,3}")
link_pattern = re.compile("https?\S+")
def clean_data(tweet):
  try:
    lang = detect(tweet)
    if lang == 'en':
      tweet_rep = emoji_pattern.sub(r'', tweet)
      tweet_rep = email_pattern.sub(r'', tweet_rep)
      tweet_rep = link_pattern.sub(r'', tweet_rep)
      tweet_rep = tweet_rep.replace("’", "‘")
      tweet_rep = tweet_rep.replace("&", "&")
      tweet_rep = tweet_rep.replace("#", '')
      tweet_rep = tweet_rep.strip()
      return tweet_rep
    else:
      return ""
  except:
    return ""

We cleaned our dataset by applying this function to each tweet. In the subsequent code snippets, we assume that the cleaned dataset is saved in the “data” variable.

 

 

Topic modelling

For topic modelling, we used a Latent Dirichlet Allocation (LDA) model, as implemented in two different tools, used as alternatives for each other:

The code for building an LDA model with gensim is simple. However, the model does not take raw texts as input, several preprocessing steps are required, such as tokenization and lemmatization, stop words removal, and extracting bigrams and trigrams. In the following sections, we explain these steps in more detail and provide code for performing them.

The code below specifies the import of Python libraries and packages required to perform topic modelling:

import nltk; nltk.download('stopwords')
import re
import numpy as npimport pandas as pd
from pprint import pprint# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

 

Data preprocessing

A useful preprocessing step is removing stop words — uninformative words that do not contribute much to the meaning of the text, such as function words (e.g. articles: “a”, “the”; prepositions: “in”, “on”) and stop words specific for a given task. The example code below collects standard English stop words from NLTK and adds custom words from our dataset.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

The next step is to divide the text into words (tokenization), which is required for further processing. There are a few tools to do this; we used gensim simple_preprocess:

def sent_to_words(sentences):
  for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuationsdata_words = list(sent_to_words(data))
print(data_words[:1])

You can expect to see an output like this:

 

Next, we build bigram and trigram models. Bigrams are sequences of two consecutive words, trigrams are sequences of three consecutive words. Using bigrams and trigrams allows us to identify relevant phrases, such as “power supply”, “gas pipeline”, etc.

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=3, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=10)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)# See trigram example

print(trigram_mod[bigram_mod[data_words[0]]])

The output of this code looks like this:

 

Next, we define a few functions that will be used to process our data before it is passed to an LDA model: remove stopwords, get bigrams and trigrams, lemmatize (i.e. obtain the base form of words: e.g. “services” -> “service”, “worked” -> “work”, etc.):

def remove_stopwords(texts):
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def make_bigrams(texts):
  return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
  return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=[‘NOUN’, ‘ADJ’, ‘VERB’, ‘ADV’]):
  """https://spacy.io/api/annotation"""
  texts_out = []
  for sent in texts:
    doc = nlp(“ “.join(sent))
    texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  return texts_out

Next, we apply these functions to our data:

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_trigrams(data_words_nostops)
# Perform lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[‘NOUN’, ‘ADJ’, ‘VERB’, ‘ADV’])

print(data_lemmatized[:1])In the lemmatization step, we only keep words that belong to one of the following parts of speech: noun, adjective, verb, adverb. Hence, you can expect to get the output that look as follows:

 

 

Topic modelling does not accept words as input, the text needs to be converted into a numerical form. To do this, we first create a data dictionary that maps a word to its integer ID:

 

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Next, we apply the doc2bow function to convert the texts into the bag-of-words (BoW) format, 
# Which is a list of (token_id, token_count) tuples.

# Create a corpus from the lemmatized text we want to analyse
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1])

The output of this step looks like this:

 

Now we have finally got to the point of building an LDA model to detect topics in our cleaned and preprocessed data.

LDAmodel

You can build an LDA model in gensim using the code below:

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
            id2word=id2word,
            num_topics=5,
            random_state=100,
            update_every=1,
            chunksize=50,
            passes=20,
            alpha=’auto’,
            per_word_topics=True)

Note that you have to specify the number of topics, i.e. the number of clusters into which the texts will be grouped. The output of the model significantly depends on this parameter, and you might need to experiment to find an optimal value for it.

You can now view the topics:

pprint(lda_model.print_topics())

The output would look similar to the following:

 

LDAMallet

Gensim’s LDAMallet is a wrapper to the Java Mallet package and an alternative to gensim’s own LdaModel. You can install it in Google Colab by running the following commands:

!curl http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip — output mallet-2.0.8.zip
!unzip mallet-2.0.8.zip

After that, you can build a model as follows:

mallet_path = ‘/content/mallet-2.0.8/bin/mallet’ # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word)

# To show the topics
ldamallet.show_topics()

You will see an output as follows:

 

Choosing an optimal number of topics in LDAMallet using coherence score

You can experiment with the number of topics to see which value of this parameter allows the model to obtain the most reasonable results. To help you in making a justified choice, you can use a measure called coherence, which essentially reflects the level of semantic similarity between high scoring words in each topic. Selecting the number of topics with the highest coherence can help obtain better results for topic modelling.

We define the function for calculating coherence scores as follows:

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
  """
  Compute c_v coherence for various number of topics
  Parameters:
  — — — — —
  dictionary : Gensim dictionary
  corpus : Gensim corpus
  texts : List of input texts
  limit : Max num of topics
  Returns:
  — — — -
  model_list : List of LDA topic models
  coherence_values : Coherence values corresponding to the LDA model with respective number of topics
  """
  coherence_values = []
  model_list = []
  for num_topics in range(start, limit, step):
    model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
    model_list.append(model)
    coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence=’c_v’)
    coherence_values.append(coherencemodel.get_coherence())
  return model_list, coherence_values

You can obtain coherence scores for the model using the following code:

model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=80, step=6)

You can plot a graph of the dependency between the number of topics and the coherence score as follows:

 

# Show graph
limit=80; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel(“Num Topics”)
plt.ylabel(“Coherence score”)
plt.legend((“coherence_values”), loc=’best’)
plt.show()

An optimal number of topics can then be selected based on the coherence score graph.

Visualization (word clouds)

Topic modelling returns N most prominent words in every topic with their weights. It’s often convenient for further analysis to visualize the results using word clouds — pictures that consist of words, the size of which reflects their prominence.

We used the Python wordcloud package to create word clouds for each of the topics returned by our topic modelling. The visualisation code is below.

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
  # Init output
  sent_topics_df = pd.DataFrame()
  # Get main topic in each document
  for i, row in enumerate(ldamodel[corpus]):
    row = sorted(row, key=lambda x: (x[1]), reverse=True)
    # Get the Dominant topic, Perc Contribution and Keywords for each document
    for j, (topic_num, prop_topic) in enumerate(row):
      if j == 0: # => dominant topic
        wp = ldamodel.show_topic(topic_num)
        topic_keywords = “, “.join([word for word, prop in wp])
        sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
      else:
        break
  sent_topics_df.columns = [‘Dominant_Topic’, ‘Perc_Contribution’, ‘Topic_Keywords’]
  # Add original text to the end of the output
  contents = pd.Series(texts)
  sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
  return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=ldamallet, corpus=corpus, texts=data)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = [‘Document_No’, ‘Dominant_Topic’, ‘Topic_Perc_Contrib’, ‘Keywords’, ‘Text’]
# Show
df_dominant_topic.head(20)

The code below creates a word cloud for topic 1. To create word clouds for other topics, modify topic ID in the following code:

df_dominant_topic[‘Dominant_Topic’]==0.0 (use 1.0 for topic 2, 2.0 for topic 3, etc.).
topic1data = df_dominant_topic[df_dominant_topic[‘Dominant_Topic’]==0.0]
keyw_topic1 = []

for each in df_dominant_topic['Keywords']:
  for l in each.strip().split(“,”):
    keyw_topic1.append(l.strip())

keyw_topic1 = list(set(keyw_topic1))
from wordcloud import WordCloud, STOPWORDS

df_dominant_topic[‘Dominant_Topic’].unique()
topic1data = df_dominant_topic[df_dominant_topic[‘Dominant_Topic’]==0.0]
comment_words = ''
stopwords = set(STOPWORDS)

# iterate through the csv file
for val in topic1data.Text:
  # typecast each val to string
  val = str(val)
  # split the value
  tokens = val.split()
  # Converts each token into lowercase
  for i in range(len(tokens)):
    tokens[i] = tokens[i].lower()
  comment_words += “ “.join(tokens)+” “

wordcloud = WordCloud(width = 800, height = 800,
                      background_color =’white’,
                      stopwords = stopwords,
                      min_font_size = 10).generate(comment_words)

# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis(“off”)
plt.tight_layout(pad = 0)
plt.show()

 

Results and Interpretation

Topic modelling and word clouds are great tools for getting insights into your text data. However, the insights you get largely depends on these two factors: your original data, and your ability to interpret the results.

In the Omdena-ACET challenge, we divided the tweets into several subsets according to the keywords used for scraping. We worked mostly with the subsets related to education, transport, and water. Here we show a few example word clouds and discuss how these can be interpreted.

Looking at word clouds for topic 1 (left) within the subset of tweets about education, we can see “disability curse”, “stigma”, “disability”, “tackle stigma” among the prominent words. Looking at word cloud for topic 2 (right), we see a few phrases with “girls” and “boys”, and the phrase “child marriage”.

These two-word clouds show us some education-related problems discussed in the tweets: lack of inclusion for children with disabilities and child marriage.

 

Topic modelling word clouds - Source: Omdena

Topic modelling word clouds – Source: Omdena

For some other subsets, word clouds may not show problems as clearly as in the previous case. For instance, in the subset of tweets about transport in Nigeria, the word clouds for topic 1 (left) and topic 2 (right) does not highlight any explicit problems.

However, road and airport transport in Lagos — a prominent state/city in Nigeria are highlighted in topic 1 while traffic around the international airport is highlighted in topic 2. These concepts are clearly important in the transport-related tweets, and we can further explore data (using other NLP techniques that are not the subject of this article) to get a better understanding of the problems and needs related to transport.

 

Topic modelling word clouds - Source: Omdena

Topic modelling word clouds – Source: Omdena

 

Conclusion

Topic modelling is a great NLP tool to explore your data and get some insights. We hope this article will help those who want to get started with topic modelling as it explains the process step-by-step.

Develop Your Career and Make a Real-World Impact

Innovation

The world´s only place for truly collaborative AI projects to apply your skills on real-world data with changemakers from around the world.

Apply & grow your skills in our real-world projects

Upcoming AI Projects

AI Teams

Make an impact in our upcoming projects in Natural Language Processing, Computer Vision, Machine Learning, Remote Sensing, and more.

Check out our projects!

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here