AI Insights

Hot Topic Detection and Tracking on Social Media during AFCON 2021 using Topic Modeling Techniques

June 9, 2022


article featured image

Introduction

The Total Energies 2021 Africa Cup of Nations (CAN Total Energies 2021) took the continent by storm for 29 days and, for some pundits and commentators, was the most captivating in the past 20 years, especially in terms of tactics, passion, surprises and level of engagement in the media and social media. 

AFCON 2021

One of the trending hashtags (#AFCON2021) raised significant interactions on social media, resulting in the spreading of ideas, topics, and discussions within echo chambers. Echo chambers refer to online clusters or communities sharing or amplifying similar opinions.

A high flux of mixed narratives, covering topics from sports technical analysis to cases of the Omicron variant of COVID-19 surging worldwide, has flooded local and international media. 

This particular edition of AFCON 2021 has been thematically characterized by a high number of complex and opposite narratives. We ran a challenge with the purpose of giving sense or understanding of this over-engagement in the media and social media by identifying the fundamental underpinnings.

This article outlines the steps that we took in the challenge to prepare several media and social media datasets for training machine learning models to detect hot topics.

Hot Topic Detection on AFCON 2021

Problem statement 

This challenge aimed to identify and characterize the hot topics that polarized activity on media and social media based on their vitality, popularity, and sentiment, using various NLP and graph-based approaches. In the optimal case, these hot topics are not only detected but their appearance is also tracked over time.

Data Collection

We collected the data using pygooglenews, Newspaper3k, snscrape, Insta scrape, Twint, Twarc, and Facebook scraping techniques. Each of these tools or techniques uses different methods to extract microblog information from media and social media (Twitter, Facebook) platforms. In total, we extracted more than 100k posts from January 09, 2022, to April 09, 2022 (for social media data), and March 29, 2021, to April 09, 2022. Social media data analysis dealt thus with posts during and after, and media data analysis with articles published before, during, and after the competition.

Data Cleaning

The cleaning phase consisted of four main steps. We started by removing irrelevant content (inexpressive emojis, Null posts, “ambulance chasers,” which refers to any advertisement content leveraging the popularity/trend of the hashtags to get visibility) using regex. Then, we removed duplicates. In the next step, we filtered by language. Indeed, for easy disentanglement of Natural Language Processing models, we worked with French and English posts. We used the langdetect python package to detect the language of the posts. 

Install langdetect

In the final step, we used the cleantext python package for extra-space (double spaces between words, leading/trailing spaces) removal, stemming (the process of converting words with similar meaning into a single word. For example, stemming of words: run, runs, running, will result in run, run, run), converting to lowercase, for both French and English posts. 

Install cleantext

Modeling

First Approach: Dependency parsing and CorEx SemiSupervised Topic Modeling

In this first approach, we ran dependency parsing using spaCy to explore the relationships between tokens. For example, in the sentence “Buhari passed bad luck to Super Eagles through phone call”, “bad” is a modifier of the noun “luck”. The aim was to identify nouns or other words modified by the adjective “bad”, to find out what the author of the media/social media post had perceived as “bad”. We found 56 unique words modified by “bad”.  

dependency parsing using spaCy

The most frequent terms mentioned in combination with “bad” were “luck” (-> bad luck), “good”, “feel”, and “decisions”. Since the observation period was only about 6 weeks, we didn’t track these words over time. In the following step of this approach, we ran CorEx, a semi-supervised topic modeling algorithm, to detect topics in our media/social media posts. In its basic form – without providing any anchor terms – CorEx detected the following topics when setting the number of topics to 8.

CorEx

This number has been chosen based on the total correlation through the tcs attribute of the package. Indeed, each topic explains a specific portion of the total correlation (TC). To determine how many topics we should use, we can look at the distribution of tcs. If adding additional topics contributes little to the overall TC, then the topics already explain a large portion of the information in the documents. If this is the case, we likely do not need more topics in our topic model. So, as a general rule of thumb, continue adding topics until the overall TC plateaus.

We observed for social media data some frequent words and defined a list of anchor terms accordingly, used to guide the topic model in their direction:

Anchor terms

Using these anchor terms, CorEx returned the following topics for social media posts:

CorEx returned

Second approach: Topic Modeling with LDA 

We started this approach with the WordCloud of social media posts to see the most weighted or dominant words. 

WorldCloud

The result suggested Covid-related terms as being among the main keywords governing the social media conversations. We then applied LDA Topic Modeling using gensim and got the following topics.

LDA Topic Modeling using gensim

Analysis of these topics showed that most of the conversation was centered around the Covid test results and mainly about the cases of positive Covid test results, as seen in topics 3 & 4. This reflected the rumors that circulated due to the doubts spread regarding test results, as none of the countries were happy to see their players tested positive for Covid-19. Analyzing the texts to get the dominant topics for each text, we got results, a sample of which is given in the table below:

Analyzing the texts to get the dominant topics for each text

For a deeper analysis of the conversations, we moved to the third approach.

Third Approach: Zero shot Learning and Topic Modeling with Top2vec

In this third approach, we started with Zero shot Learning, using zero shot classification from transformers.pipeline python package, to classify the sentiment of each post/tweet as either Neutral, Positive, or Negative. We then extracted Positive or Negative posts/tweets with high interactions. Using the Top2vec python package for Topic Modeling with distiluse-base-multilingual-cased as the embedding model, we extracted five Topics (cluster of words) from the data.

Topic 0

Topic 1

Topic 2

Topic 3

Topic 4

Insights from the word clouds 

Analysis of these topics and their top corresponding documents revealed that conversations were around Covid-19 tests, the political crisis in Cameroon, cheating, and subjects intended to deceive the online public perception of AFCON in Cameroon. We then labeled each cluster based on the dominant words and corresponding documents. We selected labels whose posts/tweets had strong polarity (Negative or Positive) and ended up with the labels below representing our hot topics.

labels

We used Zero shot Learning to label each post/tweet with the identified hot topics. Using a Bar Chart (first graph) and streamgraph (second graph), we visualized respectively the frequency of these topics in the first graph and how they polarized conversations over the time (from January to April period) in the second graph.

 

Bar Chart

streamgraph

Insights from the hot topics analysis 

Bar Chart and streamgraph results suggest that comments around Covid-19 tests and allegations of corruption significantly polarized online conversations along with the competition. Simultaneous appearance and disappearance in the streamgraph (second graph) of all hot terms suggests that actors organized in echo chambers and pushing the theory of cheating with alleged fake covid-19 tests and allegations of corruption leveraged the crisis and political instability of Cameroon to sow online distrust and spread disinformation.

Conclusion

This challenge allowed us to understand the highly engaging online behavior of media outlets and Facebook or Twitter users during the AFCON and identify the key topics using various NLP techniques in supervised and unsupervised ways.

Analysis suffered from the inability to remove noises added by posts semantically irrelevant but containing some AFCON related terms. In further research, interest will be to understand how similar communities (based on linguistic similarities of their posts/tweets)  or echo chambers evolved.

This article is written by authors: Diana RoccaroFoutse Yuehgoh, Gaelle Patricia TalotsingMrityunjay SamantaChristian Ngnie.

Ready to test your skills?

If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects

media card
Revolutionizing Short-term Traffic Congestion Prediction with Machine Learning
media card
Using Advanced Data Mining Techniques for Educational Leadership
media card
Uncovering Biases Based on Gender in Job Descriptions
media card
Leading a Local Chapter Challenge in My Home Country Nepal to Understand the Voices of Women, Youth and Marginalized Groups