Understanding Climate Change Domains through Topic Modeling

Understanding Climate Change Domains through Topic Modeling

Applying various topic modeling techniques (Top2Vec, LDA, SBERT, etc.) to extract nature-based solution (NbS) adaptation themes from text corpora.

By Bala Priya C, Nishrin Kachwala, Anju Mercian, Debaditya Shome, Farhad Sadeghlo, Hussein Jawad

 

Although many organizations have investigated the effectiveness of nature-based solutions (NBS) to help people build thriving urban and rural landscapes, such solutions have not yet been analyzed to the full extent of their potential. With this in mind, World Resources Institute (WRI) partnered with Omdena to understand how regional and global NbS can be leveraged to address and reduce the impact of climate change.

Our objective was to understand how three major coalitions, all of which embrace the key NBS forest and landscape restoration, use their websites to build networks. We used a systematic approach and to identify the climate adaptation measures that these platforms and their partners feature on their websites.

The goal of the African Forest Landscape Restoration Initiative (AFR100) in Africa and Initiative20x20 in Latin America and the Caribbean is to restore and protect forests, farms, and other landscapes to support the wellbeing of local people.Cities4Forests partners with leading cities to connect and invest in inner (urban parks), nearby (watersheds), and faraway forests (like the Amazon).

As the first step, information from the three NbS platforms and their partners, and relevant documents were collected using a scalable data collection pipeline that the team built.

 

Why Topic Modeling?

Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and thousands of chunks of text. Given the huge volume of text data thus obtained, and due to the infeasibility of manually analyzing the large text dataset to understand and gain meaningful insights, we had leveraged the use of Topic Modeling– a powerful NLP technique to understand the impacts, NbS approaches involved and the various initiatives in the direction.

A topic is a collection of words that are representative of specific information in text form. In the context of Natural Language Processing, extracting latent topics that best describe the content of the text is described as Topic modeling.

 

topic modeling

Source: Image hand-drawn by Nishrin Kachwala

 

Topic Modeling is effective for:

  • Discovering hidden patterns that are present across the collection of topics.
  • Annotating documents according to these topics.
  • Using these annotations to organize, search, and summarize texts.
  • It can also be thought of as a form of text mining to obtain recurring patterns of words in a corpus of text data.

The team experimented with topic modeling approaches that fall under unsupervised learning, semi-supervised learning, deep unsupervised learning, and matrix factorization. The team analyzed the effectiveness of the following algorithms in the context of the problem.

  • Top2Vec
  • Topic Modeling using Sentence BERT (S-BERT)
  • Latent Dirichlet Allocation (LDA)
  • Non-negative Matrix Factorization (NMF)
  • Guided LDA
  • Correlation Explanation (CorEx)

Top2Vec

Top2Vec — is an unsupervised algorithm for topic modeling and semantic search. It automatically detects topics present in the text and generates jointly embedded topic, document, and word vectors

 

topic modeling climate change

Source: arXiv:2008.09470v1 [cs.CL] — The topic words are the nearest word vectors to the topic vector

 

Data Sources used in this modeling approach are the data obtained from the heavy scraping of the platforms Initiative 20×20 and Cities4Forests, data from the light scraping pipeline, and combined data from all websites. Top2Vec performs well on reasonably large datasets;

There are three key steps taken by Top2Vec.

  • Transform documents to numeric representations
  • Dimensionality Reduction
  • Clustering of documents to find topics.

The topics present in the text are visualized using word clouds. Here’s one such word cloud that talks about deforestation, loss of green cover in certain habitats and geographical regions.

topic modeling climate change

Wordcloud

 

The algorithm can output from which platform and which line in the document the topics were found. This can help identify the organizations working towards similar causes.

 

Sentence-BERT (SBERT)

In this approach, we aim at deriving topics from clustered documents, using a class-based variant of Term Frequency- Inverse Document Frequency score (c-TF-IDF), which would allow extracting words that make each set of documents or class stand out as compared to the others

The intuition behind the method is as follows. When one applies TF-IDF as usual on a set of documents, one compares the importance of words between documents. For c-TF-IDF, one treats all documents in a single category (e.g., a cluster) as a single document and then applies TF-IDF. The result is a very long document per category and the resulting TF-IDF score would indicate the important words in a topic.

The S-BERT package extracts different embeddings based on the context of the word. Not only that, there are many pre-trained models available ready to be used. The number of top words that occur per Topic with its scores is shown below.

 

Topic Modeling

S-Bert Output: Clusters(header) of relevant topics(in rows) in the document with their TF-IDF scores

 

Latent Dirichlet Allocation (LDA)

Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and millions of chunks of text. Using Latent Dirichlet Allocation (LDA), a popular algorithm for extracting hidden topics from large volumes of text, we discovered topics covering NbS and Climate hazards underway at the NbS platforms.

 LDA’s approach to topic modeling is that it considers each document to be a collection of various topics. And each topic as a collection of words with certain probability scores.

In practice, the topic structure, per-document topic distributions, and the per-document per-word topic assignments are latent and have to be inferred from observed documents.

Once the number of topics is fed to the algorithm, it will rearrange the topic distribution in documents and word distribution in topics until there is an optimal composition of the topic-word distribution.

 

LDA with Gensim and Spacy

As every algorithm has its pros and cons, Gensim is no different than all.

Pros of using Gensim LDA are:

  • Provision to use N-grams for language modeling instead of only considering unigrams.
  • pyLDAvis for visualization
  • Gensim LDA is a relatively more stable implementation of LDA

Two metrics for evaluating the quality of our results are the perplexity and coherence score

  • Topic Coherence measures score a single topic by measuring how semantically close the high scoring words of a topic are.
  • Perplexity is a measure of surprise, which measures how well the topics in a model match a set of held-out documents; If the held-out documents have a high probability of occurring, then the perplexity score will have a lower value. The statistic makes more sense when comparing it across different models with a varying number of topics. The model with the lowest perplexity is generally considered the “best”.

 

 

We choose the optimal number of topics, by plotting the number of topics against the coherence scores they yield and choose the one that maximizes the coherence score. On the other hand, if the number of seen repetitions of words is high in the final results, we should choose lower values for the number of topics regardless of the lower coherence score.

One of the major causes that can help to provide better final evaluations for Gensim is the mallet library. Mallet library is an efficient implementation of LDA. It runs faster and gives better topics separation.

 

topic modeling climate change

 

topic modeling climate change

 

 

topic modeling climate change

pyLDAvis visual for Intertopic distance and most relevant Topic words

 

Visualizing the topics using pyLDAvis gives a global view of the topics and how they differ in terms of inter-topic distance. While at the same time allowing for a more in-depth inspection of the most relevant words that occur in individual topics. The size of the bubble is proportional to the prevalence of the topic. Better models have relatively large, well-separated bubbles spread out amongst the quadrants. When hovering over the topic bubble, its most dominant words appear on the right in a histogram.

A t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for visualizations. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

Using the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests), documents were scraped, and topics were extracted with Python’s LDA Gensim package. For visualizing complex data in three dimensions, we used scikit-learn’s t-SNE with Plotly.

Below is a visual of the Partner organization’s 3D projection for which the topic distributions were grouped manually. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization. Hovering over a point provides more information about the Topics referred to in the URL. One can further group the URLs by color and analyze the data in greater depth.

 

topic modeling climate change

Demo

 

Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization is an unsupervised learning algorithm.

It takes in the Term-Document Matrix of the text corpus and decomposes into the Document-Topic matrix and the Topic-Term matrix that quantifies how relevant the topics are in each document in the corpus and how vital each term is to a particular topic.

We use the rows of the resulting Term-Topic Matrix to get a specified number of topics. NMF is known to capture diverse topics in a text corpus and is especially useful in identifying latent topics that are not explicitly discernible from the text documents. Here’s an example of the topic word clouds generated on the light scraped data. When we would like the topics to be within a specific subset of interest or contextually more informative, we may use semi-supervised topic modeling techniques such as Guided LDA (or Seeded LDA) and CorEx(Correlation Explanation) models.

Guided Latent Dirichlet Allocation (Guided LDA)

Guided LDA is a semi-supervised topic modeling technique that takes in certain seed words per topic, and guides the topics to converge in the specified direction.

When we would like to get contextually relevant topics such as climate change impacts, mitigation strategies, and initiatives, setting a few prominent seed keywords per topic enables us to obtain topics that help understand the content of the text in the directions of interest.

For example, in the data from the platform cities4forests, the following are some of the seed words that were used, to get topics containing the most relevant keywords.

topic1=[“forests”,”degradation”,”deforestation”,”landscape”]

 

topic modeling climante change

Word Cloud from Analysis

 

CorEx Topic Model

CorEx is a discriminative topic model. It estimates the probability a document belongs to a topic given the content of that document’s words and can be used for discovering themes from a collection of documents, then further analysis such as clustering, searching, or organizing the collection of themes to gain insights.

The Total Correlation (TC) is a measure that CorEx maximizes when constructing the topic model. CorEx starts its algorithm with the random initialization, and so different runs can result in different topic models. A way of finding the best topic model is to run the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents). The topic’s underlying meaning is often interpreted by individuals building the models, and are given a name or category to reflect the topic’s understanding. This interpretation is a subjective exercise. Using anchor keywords domain-specific topics (NbS and climate change in our case) can be integrated into the CorEx model alleviating some interpretability concerns. The TC measure for the model with and without anchor words is below. The anchored models showing a better performance.

topic modeling climate change

The TC measure for the model with and without anchor words

 

After hyperparameter tuning the anchored model with anchor strength, anchor words, number of topics, making several runs for the best model, cleaning up of duplicates, the top topic is shown below.

Topic #5:

plants animals, socio bosque, sierra del divisor, national parks, restoration degraded, water nutrients, provide economic, restoration project

An Interpretation => National parks in Peru and Ecuador, which were significantly losing hectares to deforestation, are in restoration by an Initiative 20×20-affiliated project. This project also protects the local economy and endangered animal species.

 

Wrapping up

From the analysis of various topic modeling approaches, we summarize the following.

  • Compared to other topic modeling algorithms Top2vec is easy to use and the algorithm leverages joint document and word semantic embedding to find topic vectors, and does not require the text pre-processing steps of stemming, lemmatization, or stop words removal.
  • For Latent Dirichlet Allocation, the necessary text pre-processing steps are needed to obtain optimal results. As the algorithms do not use contextual embeddings, it’s not possible to account for semantic relationships completely even when considering n-gram models.
  • Topic Modeling is thus effective in gaining insights about latent topics in a collection of documents, which in our case was domain-specific, concerning documents from platforms addressing climate change impacts.
  • Limitations of topic modeling include the requirement of a lot of relevant data and consistent structure to be able to form clusters and the need for domain expertise to interpret the relevance of the results. Discriminative models with domain-specific anchor keywords such as CorEx can help in topic interpretability.

 

References

Analyzing Domestic Violence in Lockdowns: From No Data to Building an ML Classifier for Tweets

Analyzing Domestic Violence in Lockdowns: From No Data to Building an ML Classifier for Tweets

By Omdena Collaborator Harshita Chopra

 
Data mining, topic modeling, document annotations, NLP, and stacking machine learning models: A complete journey.

Artificial Intelligence and its possibilities have always fascinated me. Making machines learn through data is nothing short of amazing. When I got to learn about Omdena and it’s a wonderful initiative for bringing AI to social good using the power of global collaboration, I couldn’t stop myself from participating in its empowering challenges.

I felt truly content to be given the role of a Machine Learning Engineer in my first challenge. Connecting with a team of 50 fabulous collaborators from various countries around the world, including domain experts, data scientists, and AI experts — felt like a golden opportunity to gain knowledge in the best possible way.

It provided a space to create value out of my ideas and learn from the enhancements. The harmonizing environment gave me the experience of leading task groups and interacting with some really innovative minds.

In this blog post, I’ll walk you through a major part of the project I led and contributed to for the past month.

 

The Problem

There has been a surge in Domestic Violence (DV) and online harassment cases during COVID-19 lockdowns in India. Homes are no more a safe place for victims trapped in abusive relationships with their family members.

Domestic violence involves a pattern of psychological, physical, sexual, financial, and emotional abuse. Acts of assault, threats, humiliation, and intimidation are also considered acts of violence.

Data substantiating Domestic Violence from government resources are only available in summary form. Incidents are largely reported via calls, and hence make data and subsequent mapping difficult.

The goal of the challenge was to collect and analyze data from different social media platforms or news sources so as to gain insights on the rise in DV incidents during the nation-wide lockdown.

 

The Solution

Diverse social media platforms come up as a huge and largely untapped resource for social data and evidence. It generates a vast amount of data on a daily basis on a variety of topics. Consequently, it represents a key source of information for anyone seeking to study various issues, even the socially stigmatized and society tabooed topics like Domestic Violence.

Victims experiencing abuse are in need of earlier access to specialized services such as health care, crisis support, legal guidance, and so on. Hence the social support groups for a good social cause play a leading role in creating awareness promotion and leveraging various dimensions of social support like emotional, instrumental, and informational support to the victims.

Red Dot Foundation plans to deal with this challenge. When the victims seek help, it is important to identify and analyze those critical posts and acknowledge the help needed with more immediate impact.

Tasks were divided to mine data from different sources: Twitter, Reddit, YouTube, News articles, Government reports, and Google trends. After the acquisition of huge amounts of data, the next step was filtering out relevant posts through topic modeling and keywords. This was followed by annotation of data and then building an NLP based machine learning classifier.

In this blog post, Tweets would be in the spotlight!

 

 

Scraping data with the right queries

Tweets need to be extracted in the pre-lockdown and during the lockdown period so as to judge the surge in domestic violence. Hence, we took a time-frame of January’20 to May’20.

Tweepy (the official tweets scraping API of twitter) extracts tweets only from the past seven days, making it a bothering limitation. Hence, we needed an alternative for mining old tweets with the location.

GetOldTweets3 is a fantastic Python library for this task. Twitter’s advanced search can do wonders for generating your customized query. In order to extract harassment-related posts, here are a few examples of queries we used:

 

 

 

Using ANDcombinations of relationships words with actions and nouns yield good results. The until and since attributes hold the limits of the time frame.

The setNear() feature accepts a location name (like Delhi, Maharashtra, India, etc) or latitude and longitude of that region. The central point of India is approximately around (22,80) degrees. The setWithin() feature accepts the radius around that point, and 1800 km generally covers India and nearby places.

After executing more such queries with different keywords, we had thousands of tweets in handsome relevant topics and some irrelevant.

 

Data needs to be classified  –  Would topic modeling work?

Since a considerable number of tweets in our huge datasets were not related to the kind of harassment we were looking for, we needed some filtering. Classifying tweets into broad topics was the goal. Topic modeling was the first thing that clicked.

Topic modeling is an unsupervised learning process to automatically identify topics present in a collection of documents and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Latent Dirichlet Allocation is the most popular technique to do so. It assumes that documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

 

 

Topic 0 words are generally included in awareness posts or #BanTiktok posts due to its inappropriate content.
> Topic 1 words are headlines or real victim stories.

 

Topic modeling works best when the topics are considerably distinct or not too related to each other.

The generated topics didn’t satisfy our target of classifying as relevant or irrelevant. Hence we had to pick up another approach, since our dataset, in general, talks about kinds of harassment.

Rule-based classification turned up to be a more precise approach in this task. I created three sets of keywords to look for — relationships, violence verbs, and not-relevant words. The following algorithm as implemented in Python to filter out some documents.

 

Relationships: [List of keywords like wife, husband, housemaid etc.]
Verbs        : [List of keywords like harass, beat, abuse etc.]
Not-relevant : [List of keywords like webinar, politics, movie etc.]
Iterating through document:
{ 
  R1: (any word in Relationships) AND (any word in Verbs) -> 'Keep'
  R2: (any word in Not-relevant) -> 'Discard'
}

 

The dataset was filtered pretty much based on our customer needs. Now comes up with the task of modeling. But we need annotations for training a supervised model.

 

Deciding Labels and Annotating Tweets

Eminent domain experts helped in coming up with the categories to classify tweets based on their context. Proper labeling guidelines were set up and training sessions helped to label tweets properly, keeping in mind the edge cases. For document classification, a quick and awesome tool called Doccano was used. Several collaborators helped by taking up queues of data points and annotating them. Following were the labels used:

  • DV_OPINION_ADVOCATE
    (advocating against domestic abuse)
  • DV_OPINION_DENIER
    (denying the existence of domestic abuse)
  • DV_OPINION_INFO_NEWS
    (stating factual information or news)
  • DV_STORY
    (describing an incident of domestic abuse)
  • NON_D_VIOLENCE_ABOUT
    (other kinds of harassment)
  • NON_D_VIOLENCE_DIRECTED
    (harassment directed at individual or community)
  • NO_VIOLENCE

 

Domestic Violence Twitter

Analytics derived from NER.

 

 

And data for modeling is ready!

After all the annotations and some fabulous work with collaborators, we’re ready with an incredible training dataset.

Tinkering with Natural Language Processing…

Once pre-processing of texts by lowering cases, removing URLs, punctuation, stopwords, followed by lemmatization was done — we were ready to play around with modeling techniques.

To convert words to vectors (machine learns through numbers), experimenting with TF-IDF Vectorizer gave good results but we had a very limited vocabulary, while the inference data would have a greater variety of words. Therefore, a decision of using pre-trained word embeddings was made.

Our model used FastText English word vectors(wiki-news-300d-1M.vec) and IndicNLP word vectors (indicnlp.v1.hi.vec) for Hindi and Hinglish languages present in the documents.

Since tweets related to DV stories were quite less in number, data augmentation was used on these — by creating new sentences using synonyms of the original words.

nlpgaugis a library for textual augmentation in machine learning experiments. The goal is to improve model performance by generating augmented textual data. It’s also able to generate adversarial examples to prevent adversarial attacks.

 

 

Bringing into play  – Machine Learning Model(s)

A number of models including BERT, SVM, XGBoostClassifier, and many more were evaluated. Since there were really minute differences between some similar classes, we needed to combine two sets of labels.

After combining similar labels:

 

 

Limitations faced — Data under various classes was not easily separable because 3 classes plainly talked about Domestic Violence (story, opinion, news/info) which made it tough for the classifier to spot marginal variation in semantics.

Also, data under DV_STORY had the least number of samples given the fact that it was the most relevant class.

Hence, to deal with an imbalanced dataset, Under Sampling using NeighbourhoodCleaningRule was used from the imbalanced-learn library. The resampled data was fed to Stacked Models.

Stacking is a way of combining predictions from multiple different types of ML models, that introduces the concept of a meta learner.

 

 

Source: GeeksforGeeks

 

Level 0 learners:
– Random Forest Classifier
– Support Vector Classifier
– MLP Classifier
– LGBM Classifier

Level 1 meta-learner:
SVC with hyperparameter tuning and custom class weights.

 

Class Encoding — 0: DV_INCIDENT, 1: DV_OPINION, 2: DV_OPINION_INFO_NEWS, 3: NON_D_VIOLENCE_ABOUT, 4: NO_VIOLENCE

 

This pretty much sums up the modeling. This model was used to predict labels on 8000 rows long dataset containing tweets. The misclassifications were skimmed through and corrected in some crucial classes in order to deliver the best data.

I feel glad to be a part of this incredible community of change-makers. Making some great connections through this amazing journey is like an icing on the cake.

So excited to collaborate in the upcoming mind-blowing projects, making the world a better place using AI for good!

 

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 
Using Topic Modeling and Coreference Resolution to Identify Land Conflicts and Its Causes

Using Topic Modeling and Coreference Resolution to Identify Land Conflicts and Its Causes

Improving the accuracy score from 83% to 93% to identify land conflict topics in news articles.

 

Identifying environmental conflict events in India using news media articles

 

Part of this project was to scrape news media articles to identify environmental conflict events such as resource conflicts, land appropriation, human-wildlife conflict, and supply chain issues.

With an initial focus on India, we also connected conflict events to their jurisdictional policies to identify how to resolve those conflicts faster or to identify a gap in legislation.

Part of the pipeline in building this Language Model was a semi-supervised attempt in order to be Improving Topic Modeling Performance to increase environmental sustainability, whose process and the outcome are available here.

In short, in order to make this Topic Modeling model robust, Coreference Resolution was suggested as one of the possible additions.

 

The Solution

What exactly is Coreference Resolution?

Coreference resolution is the task of finding all expressions that refer to the same entity in a text (1)

 

Use Cases

  1. In the context of this project, Coreference Resolution could be best used in order to Improving Topic Modeling Performance by replacing references with the same entity in order to better model the actual meaning of the text. This increases the Tf-Idf of generalized entities and it removes ambiguous words that are meaningless for classification.
  2. Another use-case would be to use the Coreferenced text data as additional features, along with Named Entity Recognition tags, in any classification approach. A one-hot-encoded version of unique entities can be used as input to factorization machines or other approaches for spare modeling.

 

Which packages are available to implement it?

 

An interpretation of a girl with magnifying glass looking for python packages

Exploring almost every available python package out there.

 

We toyed around with some packages which seemed good in theory but were rather challenging to apply to our specific task. We needed a package that would be user-friendly, as a script would have to be developed for 28 people to take and be able to apply without much struggle.

NeuralCoref, Stanford NLP, Apache Open NLP, and Allennlp. After trying out each package, I personally preferred Allennlp, but as a team, we decided to use NeuralCoref with a short but effective script written by one of the collaborators Srijha Kalyan.

The code was applied to the article data which was annotated by fellow collaborators from the Annotation Task Group. This resulted in a CSV file with the original article titles, the original article text, and a new column of Coreference article text; not as chains but in the same written format as the original article text.

 

An image of a table containing various fields of data regarding land conflicts

 

The output was then sent to the Topic Modeling Task Team, which at that point was sitting on an accuracy of 83%, with the Coreference Resolution data, the accuracy jumped to 93%.

That’s an 11% improvement! All the hard work and hours were clearly worth it!

 

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here