The Total Energies 2021 Africa Cup of Nations (CAN Total Energies 2021) took the continent by storm for 29 days and, for some pundits and commentators, was the most captivating in the past 20 years, especially in terms of tactics, passion, surprises and level of engagement in the media and social media.
One of the trending hashtags (#AFCON2021) raised significant interactions on social media, resulting in the spreading of ideas, topics, and discussions within echo chambers. Echo chambers refer to online clusters or communities sharing or amplifying similar opinions.
A high flux of mixed narratives, covering topics from sports technical analysis to cases of the Omicron variant of COVID-19 surging worldwide, has flooded local and international media.
This particular edition of AFCON 2021 has been thematically characterized by a high number of complex and opposite narratives. We ran a challenge with the purpose of giving sense or understanding of this over-engagement in the media and social media by identifying the fundamental underpinnings.
This article outlines the steps that we took in the challenge to prepare several media and social media datasets for training machine learning models to detect hot topics.
This challenge aimed to identify and characterize the hot topics that polarized activity on media and social media based on their vitality, popularity, and sentiment, using various NLP and graph-based approaches. In the optimal case, these hot topics are not only detected but their appearance is also tracked over time.
We collected the data using pygooglenews, Newspaper3k, snscrape, Insta scrape, Twint, Twarc, and Facebook scraping techniques. Each of these tools or techniques uses different methods to extract microblog information from media and social media (Twitter, Facebook) platforms. In total, we extracted more than 100k posts from January 09, 2022, to April 09, 2022 (for social media data), and March 29, 2021, to April 09, 2022. Social media data analysis dealt thus with posts during and after, and media data analysis with articles published before, during, and after the competition.
The cleaning phase consisted of four main steps. We started by removing irrelevant content (inexpressive emojis, Null posts, “ambulance chasers,” which refers to any advertisement content leveraging the popularity/trend of the hashtags to get visibility) using regex. Then, we removed duplicates. In the next step, we filtered by language. Indeed, for easy disentanglement of Natural Language Processing models, we worked with French and English posts. We used the langdetect python package to detect the language of the posts.
In the final step, we used the cleantext python package for extra-space (double spaces between words, leading/trailing spaces) removal, stemming (the process of converting words with similar meaning into a single word. For example, stemming of words: run, runs, running, will result in run, run, run), converting to lowercase, for both French and English posts.
First Approach: Dependency parsing and CorEx SemiSupervised Topic Modeling
In this first approach, we ran dependency parsing using spaCy to explore the relationships between tokens. For example, in the sentence “Buhari passed bad luck to Super Eagles through phone call”, “bad” is a modifier of the noun “luck”. The aim was to identify nouns or other words modified by the adjective “bad”, to find out what the author of the media/social media post had perceived as “bad”. We found 56 unique words modified by “bad”.
The most frequent terms mentioned in combination with “bad” were “luck” (-> bad luck), “good”, “feel”, and “decisions”. Since the observation period was only about 6 weeks, we didn’t track these words over time. In the following step of this approach, we ran CorEx, a semi-supervised topic modeling algorithm, to detect topics in our media/social media posts. In its basic form – without providing any anchor terms – CorEx detected the following topics when setting the number of topics to 8.
This number has been chosen based on the total correlation through the tcs attribute of the package. Indeed, each topic explains a specific portion of the total correlation (TC). To determine how many topics we should use, we can look at the distribution of tcs. If adding additional topics contributes little to the overall TC, then the topics already explain a large portion of the information in the documents. If this is the case, we likely do not need more topics in our topic model. So, as a general rule of thumb, continue adding topics until the overall TC plateaus.
We observed for social media data some frequent words and defined a list of anchor terms accordingly, used to guide the topic model in their direction:
Using these anchor terms, CorEx returned the following topics for social media posts:
Second approach: Topic Modeling with LDA
We started this approach with the WordCloud of social media posts to see the most weighted or dominant words.
Analysis of these topics showed that most of the conversation was centered around the Covid test results and mainly about the cases of positive Covid test results, as seen in topics 3 & 4. This reflected the rumors that circulated due to the doubts spread regarding test results, as none of the countries were happy to see their players tested positive for Covid-19. Analyzing the texts to get the dominant topics for each text, we got results, a sample of which is given in the table below:
For a deeper analysis of the conversations, we moved to the third approach.
Third Approach: Zero shot Learning and Topic Modeling with Top2vec
In this third approach, we started with Zero shot Learning, using zero shot classification from transformers.pipeline python package, to classify the sentiment of each post/tweet as either Neutral, Positive, or Negative. We then extracted Positive or Negative posts/tweets with high interactions. Using the Top2vec python package for Topic Modeling with distiluse-base-multilingual-cased as the embedding model, we extracted five Topics (cluster of words) from the data.
Insights from the word clouds
Analysis of these topics and their top corresponding documents revealed that conversations were around Covid-19 tests, the political crisis in Cameroon, cheating, and subjects intended to deceive the online public perception of AFCON in Cameroon. We then labeled each cluster based on the dominant words and corresponding documents. We selected labels whose posts/tweets had strong polarity (Negative or Positive) and ended up with the labels below representing our hot topics.
We used Zero shot Learning to label each post/tweet with the identified hot topics. Using a Bar Chart (first graph) and streamgraph (second graph), we visualized respectively the frequency of these topics in the first graph and how they polarized conversations over the time (from January to April period) in the second graph.
Insights from the hot topics analysis
Bar Chart and streamgraph results suggest that comments around Covid-19 tests and allegations of corruption significantly polarized online conversations along with the competition. Simultaneous appearance and disappearance in the streamgraph (second graph) of all hot terms suggests that actors organized in echo chambers and pushing the theory of cheating with alleged fake covid-19 tests and allegations of corruption leveraged the crisis and political instability of Cameroon to sow online distrust and spread disinformation.
This challenge allowed us to understand the highly engaging online behavior of media outlets and Facebook or Twitter users during the AFCON and identify the key topics using various NLP techniques in supervised and unsupervised ways.
Analysis suffered from the inability to remove noises added by posts semantically irrelevant but containing some AFCON related terms. In further research, interest will be to understand how similar communities (based on linguistic similarities of their posts/tweets) or echo chambers evolved.