Applying various topic modeling techniques (Top2Vec, LDA, SBERT, etc.) to extract nature-based solution (NbS) adaptation themes from text corpora.
By Bala Priya C, Nishrin Kachwala, Anju Mercian, Debaditya Shome, Farhad Sadeghlo, Hussein Jawad
Although many organizations have investigated the effectiveness of nature-based solutions (NBS) to help people build thriving urban and rural landscapes, such solutions have not yet been analyzed to the full extent of their potential. With this in mind, World Resources Institute (WRI) partnered with Omdena to understand how regional and global NbS can be leveraged to address and reduce the impact of climate change.
Our objective was to understand how three major coalitions, all of which embrace the key NBS forest and landscape restoration, use their websites to build networks. We used a systematic approach and to identify the climate adaptation measures that these platforms and their partners feature on their websites.
The goal of the African Forest Landscape Restoration Initiative (AFR100) in Africa and Initiative20x20 in Latin America and the Caribbean is to restore and protect forests, farms, and other landscapes to support the wellbeing of local people. Cities4Forests partners with leading cities to connect and invest in inner (urban parks), nearby (watersheds), and faraway forests (like the Amazon).
As the first step, information from the three NbS platforms and their partners, and relevant documents were collected using a scalable data collection pipeline that the team built.
Why Topic Modeling?
Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and thousands of chunks of text. Given the huge volume of text data thus obtained, and due to the infeasibility of manually analyzing the large text dataset to understand and gain meaningful insights, we had leveraged the use of Topic Modeling– a powerful NLP technique to understand the impacts, NbS approaches involved and the various initiatives in the direction.
A topic is a collection of words that are representative of specific information in text form. In the context of Natural Language Processing, extracting latent topics that best describe the content of the text is described as Topic modeling.
Topic Modeling is effective for:
- Discovering hidden patterns that are present across the collection of topics.
- Annotating documents according to these topics.
- Using these annotations to organize, search, and summarize texts.
- It can also be thought of as a form of text mining to obtain recurring patterns of words in a corpus of text data.
The team experimented with topic modeling approaches that fall under unsupervised learning, semi-supervised learning, deep unsupervised learning, and matrix factorization. The team analyzed the effectiveness of the following algorithms in the context of the problem.
- Topic Modeling using Sentence BERT (S-BERT)
- Latent Dirichlet Allocation (LDA)
- Non-negative Matrix Factorization (NMF)
- Guided LDA
- Correlation Explanation (CorEx)
Top2Vec — is an unsupervised algorithm for topic modeling and semantic search. It automatically detects topics present in the text and generates jointly embedded topic, document, and word vectors
Data Sources used in this modeling approach are the data obtained from the heavy scraping of the platforms Initiative 20×20 and Cities4Forests, data from the light scraping pipeline, and combined data from all websites. Top2Vec performs well on reasonably large datasets;
There are three key steps taken by Top2Vec.
- Transform documents to numeric representations
- Dimensionality Reduction
- Clustering of documents to find topics.
The topics present in the text are visualized using word clouds. Here’s one such word cloud that talks about deforestation, loss of green cover in certain habitats and geographical regions.
The algorithm can output from which platform and which line in the document the topics were found. This can help identify the organizations working towards similar causes.
In this approach, we aim at deriving topics from clustered documents, using a class-based variant of Term Frequency- Inverse Document Frequency score (c-TF-IDF), which would allow extracting words that make each set of documents or class stand out as compared to the others
The intuition behind the method is as follows. When one applies TF-IDF as usual on a set of documents, one compares the importance of words between documents. For c-TF-IDF, one treats all documents in a single category (e.g., a cluster) as a single document and then applies TF-IDF. The result is a very long document per category and the resulting TF-IDF score would indicate the important words in a topic.
The S-BERT package extracts different embeddings based on the context of the word. Not only that, there are many pre-trained models available ready to be used. The number of top words that occur per Topic with its scores is shown below.
Latent Dirichlet Allocation (LDA)
Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and millions of chunks of text. Using Latent Dirichlet Allocation (LDA), a popular algorithm for extracting hidden topics from large volumes of text, we discovered topics covering NbS and Climate hazards underway at the NbS platforms.
LDA’s approach to topic modeling is that it considers each document to be a collection of various topics. And each topic as a collection of words with certain probability scores.
In practice, the topic structure, per-document topic distributions, and the per-document per-word topic assignments are latent and have to be inferred from observed documents.
Once the number of topics is fed to the algorithm, it will rearrange the topic distribution in documents and word distribution in topics until there is an optimal composition of the topic-word distribution.
LDA with Gensim and Spacy
As every algorithm has its pros and cons, Gensim is no different than all.
Pros of using Gensim LDA are:
- Provision to use N-grams for language modeling instead of only considering unigrams.
- pyLDAvis for visualization
- Gensim LDA is a relatively more stable implementation of LDA
Two metrics for evaluating the quality of our results are the perplexity and coherence score
- Topic Coherence measures score a single topic by measuring how semantically close the high scoring words of a topic are.
- Perplexity is a measure of surprise, which measures how well the topics in a model match a set of held-out documents; If the held-out documents have a high probability of occurring, then the perplexity score will have a lower value. The statistic makes more sense when comparing it across different models with a varying number of topics. The model with the lowest perplexity is generally considered the “best”.
We choose the optimal number of topics, by plotting the number of topics against the coherence scores they yield and choose the one that maximizes the coherence score. On the other hand, if the number of seen repetitions of words is high in the final results, we should choose lower values for the number of topics regardless of the lower coherence score.
One of the major causes that can help to provide better final evaluations for Gensim is the mallet library. Mallet library is an efficient implementation of LDA. It runs faster and gives better topics separation.
Visualizing the topics using pyLDAvis gives a global view of the topics and how they differ in terms of inter-topic distance. While at the same time allowing for a more in-depth inspection of the most relevant words that occur in individual topics. The size of the bubble is proportional to the prevalence of the topic. Better models have relatively large, well-separated bubbles spread out amongst the quadrants. When hovering over the topic bubble, its most dominant words appear on the right in a histogram.
A t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for visualizations. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
Using the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests), documents were scraped, and topics were extracted with Python’s LDA Gensim package. For visualizing complex data in three dimensions, we used scikit-learn’s t-SNE with Plotly.
Below is a visual of the Partner organization’s 3D projection for which the topic distributions were grouped manually. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization. Hovering over a point provides more information about the Topics referred to in the URL. One can further group the URLs by color and analyze the data in greater depth.
Non-negative Matrix Factorization (NMF)
Non-negative Matrix Factorization is an unsupervised learning algorithm.
It takes in the Term-Document Matrix of the text corpus and decomposes into the Document-Topic matrix and the Topic-Term matrix that quantifies how relevant the topics are in each document in the corpus and how vital each term is to a particular topic.
We use the rows of the resulting Term-Topic Matrix to get a specified number of topics. NMF is known to capture diverse topics in a text corpus and is especially useful in identifying latent topics that are not explicitly discernible from the text documents. Here’s an example of the topic word clouds generated on the light scraped data. When we would like the topics to be within a specific subset of interest or contextually more informative, we may use semi-supervised topic modeling techniques such as Guided LDA (or Seeded LDA) and CorEx(Correlation Explanation) models.
Guided Latent Dirichlet Allocation (Guided LDA)
Guided LDA is a semi-supervised topic modeling technique that takes in certain seed words per topic, and guides the topics to converge in the specified direction.
When we would like to get contextually relevant topics such as climate change impacts, mitigation strategies, and initiatives, setting a few prominent seed keywords per topic enables us to obtain topics that help understand the content of the text in the directions of interest.
For example, in the data from the platform cities4forests, the following are some of the seed words that were used, to get topics containing the most relevant keywords.
CorEx Topic Model
CorEx is a discriminative topic model. It estimates the probability a document belongs to a topic given the content of that document’s words and can be used for discovering themes from a collection of documents, then further analysis such as clustering, searching, or organizing the collection of themes to gain insights.
The Total Correlation (TC) is a measure that CorEx maximizes when constructing the topic model. CorEx starts its algorithm with the random initialization, and so different runs can result in different topic models. A way of finding the best topic model is to run the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents). The topic’s underlying meaning is often interpreted by individuals building the models, and are given a name or category to reflect the topic’s understanding. This interpretation is a subjective exercise. Using anchor keywords domain-specific topics (NbS and climate change in our case) can be integrated into the CorEx model alleviating some interpretability concerns. The TC measure for the model with and without anchor words is below. The anchored models showing a better performance.
After hyperparameter tuning the anchored model with anchor strength, anchor words, number of topics, making several runs for the best model, cleaning up of duplicates, the top topic is shown below.
plants animals, socio bosque, sierra del divisor, national parks, restoration degraded, water nutrients, provide economic, restoration project
An Interpretation => National parks in Peru and Ecuador, which were significantly losing hectares to deforestation, are in restoration by an Initiative 20×20-affiliated project. This project also protects the local economy and endangered animal species.
From the analysis of various topic modeling approaches, we summarize the following.
- Compared to other topic modeling algorithms Top2vec is easy to use and the algorithm leverages joint document and word semantic embedding to find topic vectors, and does not require the text pre-processing steps of stemming, lemmatization, or stop words removal.
- For Latent Dirichlet Allocation, the necessary text pre-processing steps are needed to obtain optimal results. As the algorithms do not use contextual embeddings, it’s not possible to account for semantic relationships completely even when considering n-gram models.
- Topic Modeling is thus effective in gaining insights about latent topics in a collection of documents, which in our case was domain-specific, concerning documents from platforms addressing climate change impacts.
- Limitations of topic modeling include the requirement of a lot of relevant data and consistent structure to be able to form clusters and the need for domain expertise to interpret the relevance of the results. Discriminative models with domain-specific anchor keywords such as CorEx can help in topic interpretability.