Understanding Nature-Based Solutions through Natural Language Processing and Machine Learning

Understanding Nature-Based Solutions through Natural Language Processing and Machine Learning

Nature-based solutions (NbS) can help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes.

By Bala Priya, Simone Perazzoli, Nishrin Kachwala, Anju Mercian, Priya Krishnamoorthy, and Rosana de Oliveira Gomes

 

Why

NbS harness nature to tackle environmental challenges that affect human society, such as climate change, water insecurity, pollution, and declining food production. NbS can also help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes through, for example, growing trees in rural areas to boost crop yields and lock water in the soil. Although many organizations have investigated the effectiveness of NbS, these solutions have not yet been analyzed to the full extent of their potential.

In order to analyze such NbS approaches in greater detail, World Resources Institute (WRI) partnered with Omdena to better understand how regional and global NbS, such as forest and landscape restoration, can be leveraged to address and reduce climate change impacts across the globe.

In an attempt to identify and analyze such approaches, we investigated three main platforms that bring organizations together to promote initiatives that restore forests, farms, and other landscapes and enhance tree cover to improve human well-being:

 

Considering the aforementioned, the project goal is to assess the network of these three coalition websites through a systematic approach and to identify climate adaptation measures covered by these platforms and their partners.

The integral parts of the project’s workflow included building a scalable data collection pipeline to scrape the data from the platforms and partnering organizations, and several useful PDFs; leveraging several techniques from Natural Language Processing, such as building a Neural Machine Translation pipeline to translate non-English text to English, performing sentiment analysis for identifying potential gaps, experimenting with language models that were optimal for the given use cases, exploring various supervised and unsupervised topic modeling techniques to get meaningful insights and latent topics present in the voluminous text data collected, leveraging the novel Zero Shot Classification(ZSC) to identify the impacts and interventions, building a Knowledge-Based Question Answering(KBQA) system, and recommender system.

 

Project workflow

data science project workflow

 

Data collection

 

data collection

 

The platforms engaged in climate-risk mitigation were studied for several factors, including the climate risks in each region, as well as initiatives taken by the platforms and their partners, NbS employed for mitigating climate risks, the effectiveness of adaptations, goals, road map of the platform, among others. This information was gathered through:

a) Heavy scraping of Platform websites: It involved data scraping from all the platforms´ website pages using Python scripts. This process involved manual effort in customizing the scraping suitable for each page; accordingly, extending this model would involve some effort. Approximately 10MB of data was generated through this technique.

b) Light Scraping of Platform websites and partner organizations: it involved the obtention of the platforms sitemap. Once it was done, organization websites were crawled to obtain the text information. This method can be extended to other platforms with minimal effort. The volume of this data generated is around 21MB.

c) PDF Text Data Scraping of Platform and Other Sites: The platform websites presented several informative PDF documents (including reports and case studies), which were helpful for use in the downstream models, including the Q&A system, recommendation system, etc. This process was completely automated by the PDF text-scraping pipeline, which prepares a CSV of the PDF data and then generates a consolidated CSV file containing paragraph text from all the PDFs mentioned in the input CSV file. This pipeline can be incrementally used to generate the PDF text in batches. The NLP models utilized all of the PDF documents from the three platform websites, as well as some documents containing general information on NbS referred by WRI. Approximately 4MB of data was generated from the available PDFs.

 

Data preprocessing

a) Data Cleaning: an initial step comprising the removal of unnecessary text, text with length <50, as well as duplicates.

b) Language Detection and Translation: This step involved the development of a pipeline for language detection and translation to be applied to text data gathered from the 3 main sources described above.

Language Detection was performed by analyzing text using different deep learning pre-trained models such as langdetect and pycld3. Once the language is detected, the result is used as an input parameter for the translation pipeline. In this step, pre-trained multilingual models are downloaded from the Helsinki NLP repository available at HuggingFace.com (an NLP processing company). Text is tokenized and organized in batches to be sequentially fed into the pre-trained model. To enhance function performance, the pipeline was developed with GPU support, if available. Also, once a model is downloaded. it is cached into the program memory so it doesn’t need to be downloaded again.

The translation performed well with the majority of texts to which it was applied (most were in Spanish or Portuguese), being able to generate well-structured and coherent results, especially considering the scientific vocabulary of the original texts.

c) NLP preparation: This step was applied to the CSVs files generated through the scrap, after translation pipeline and was composed by Punctuation removal, Stemming, Lemmatization, Stop words removal, Part of Speech tagging (POS), Tagging, Chunking

 

Data modeling

Statistical analysis 

Statistical Analysis was performed on the preprocessed data by exploring the role of climate change impacts, interventions, and ecosystems involved in the three platforms’ portfolios using two different approaches: zero-shot classification and cosine similarity.

1.) Zero Shot Classification: This model assigns probabilities to which user-defined labels a text would best fit. We applied a zero-shot classification model from Hugging Face to classify descriptions for a given set of keywords belonging to climate-change impacts, interventions, and ecosystems for each of the three platforms. For ZSC, it combined the heavy-scraped datasets into one CSV for each website. The scores computed by ZSC can be interpreted as probabilities that the class belongs to a particular description. As a rule, we only considered as relevant those scores at or above 0.85.

Let’s consider the following example:

Description: The Greater Amman Municipality has developed a strategy called Green Amman to ban the destruction of forests The strategy focuses on the sustainable consumption of legally sourced wood products that come from sustainably managed forests The Municipality sponsors the development of sustainable forest management to provide long term social economic and environmental benefits Additional benefits include improving the environmental credentials of the municipality and consolidating the GAMs environmental leadership nationally as well as improving the quality of life and ecosystem services for future generations.

Model Predictions: The model assigned the following probabilities based upon the foregoing description:

  • Climate Change Impact Predictions: ‘loss of vegetation’: 0.89 , ‘deforestation’: 0.35, ‘GHG emissions’: 0.23, ‘rapid growth’ : 0.20, ‘loss of biodiversity’: 0.15 … (21 additional labels)
  • Types of Interventions Predictions: ‘management’: 0.92 , ‘protection’: 0.90 , ‘afforestation’: 0.65 , ‘enhance urban biodiversity’: 0.49, ‘Reforestation’: 0.38 … (16 additional labels)
  • Ecosystems: ‘Temperate forests’: 0.66, ‘Mediterranean shrubs and Forests’: 0.62, ‘Created forest’: 0.57, ‘Tropical and subtropical forests’: 0.55 … (13 additional labels)

 

For the description above, we see that the Climate-Change-Impact prediction is ‘loss of vegetation’, Types-of- Intervention prediction is ‘management’ or ‘protection’, and the Ecosystems prediction is empty.

2.) Cosine Similarity: Cosine Similarity compares vectors created by keywords, generated through Hugging Face models…. (how these keywords were computed) and descriptions, and scores the similarity in direction of these vectors. We then plot the scores with respect to technical and financial partners and a set of keywords. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations. This approach was useful to validate the results of the ZSC approach.

Aligning these results, it was possible to answers the following questions:

  • What are the climate hazards and climate impacts most frequently mentioned by the NbS platforms’ portfolios?
  • What percentage of interventions/initiatives take place in highly climate-vulnerable countries or areas?
  • What ecosystem/system features most prominently in the platforms when referencing climate impacts?

 

This model was applied on descriptions from all three heavy-scraped websites, and compared cross-referenced results (such as Climate Change Impact vs Intervention, or Climate Change Impact vs Ecosystems, or Ecosystems vs Intervention) for all three websites. Further, we created plots based on country and partners (technical and financial) for all three websites.

 

Sentiment analysis

Sentiment Analysis (SA) is the automatic generation of sentiment from text, utilizing both data mining and NLP. Here, SA is applied to identify potential gaps and solutions through the corpus text extracted from the three main platforms. In this Task, it implemented the following well consolidated unsupervised approaches: VADER, TextBlob, AFINN, FlairNLP, AdaptNLP Easy Sequence Classification. A new approach, Bert-clustering, was proposed by Omdena Team and it is based on Bert embedding of a positive/negative keywords list and computing distance(s) of these embedded descriptions to the corresponding cluster, were:

  • negative reference: words related to challenges and hazards, which give us a negative sentiment
  • positive reference: words related to NBS solutions, strategies, interventions, and adaptations outcomes, which give us a positive sentiment

For modeling purposes, the threshold values adopted are presented in table 1.

sentiment analysis

 

According to the scoring of the models presented in Table 2, AdaptNLP, Flair, and BERT/clustering approaches exhibited better performance compared to the lexicon-based models. Putting the limitations of unsupervised learning aside, BERT/clustering is a promising approach that could be improved for further scaling. SA can be a challenging task, since most algorithms for SA are trained on ordinary-language comments (such as from reviews and social media posts), while the corpus text from the platforms has a more specialized, technical, and formal vocabulary, which raises the need to develop a more personalized analysis, such as the BERT/clustering approach.

sentiment analysis

 

Across all organizations, it was observed that the content focuses on solutions rather than gaps. Overall, potential solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.

 

Topic modeling

Topic Modeling is a method for automatically finding topics from a collection of documents that best represent the information within the collection. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together. The algorithms/ techniques that were explored for the project include Top2Vec, SBERT: and Latent Dirichlet Allocation (LDA) with Gensim and Spacy.

  • Top2Vec: For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
World Cloud

Word cloud generated when a search was performed for the word “deforestation”.

 

  • S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
  • LDA: In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution. A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, I20x20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more. One can further group the URLs by color grouping and analyze the data in greater depth. A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.

 

Other NLP/ML techniques

Besides the aforementioned above, other techniques were also exploited in this project and will be presented in further articles, such as

Network Analysis presents interconnections among the platform, partners, and connected websites. A custom network crawler was created, along with heuristics such as prioritizing NbS organizations over commercial linkages (this can be tuned) and parsing approx. 700 organization links per site (this is another tunable parameter). We then ran the script with different combinations of source nodes (usually the bigger organizations like AFR100, INITIATIVE20x20 were selected as sources to achieve the required depth in the network). Based on these experiments, we derived a master set of irrelevant sites (such as social media, advertisements, site-protection providers, etc.) that are not crawled by our software.

Knowledge Graphs represent the information extracted from the text in the websites based on the relationships between them. A pipeline was built to extract the triplets based upon the subject / object relationship using StanfordNLP’ss OpenIE on the paragraph. Subjects and objects are represented by nodes, and relations by the paths (or “edges”) between them.

Recommendation Systems: The recommender systems application is built based upon the information extracted from the partner’s websites, with a goal to provide recommendations of possible solutions already available and implemented within the network of partners from WRI. The application allows a user to search for similarities across organizations (collaborative filtering) as well as similarities in the content of the solution (content-based filtering).

Question & Answer System: Our knowledge-based Question & Answer system answers questions in the domain context of the text scraped data from the PDF documents from the main platform websites, as well as a few domain-related PDF documents which contain the climate risks and NbS information, as well as the light-scraped data obtained from the platforms and their partner websites.

The KQnA system is based on Facebook’s Deep Passage Retrieval method which provides better context by generating vector embeddings. The RAG neural network(RAG) generates a specific answer for a given question conditioned on the retrieved documents. RAG gives the most of an answer from the shortlisted documents. The KQnA system is built on the open-source Deepset.ai Haystack framework and hosted on a virtual machine, accessible via REST API to the Streamlit UI.

The platform websites have many PDF documents containing extensive significant information that would take a lot of time for humans to process. The Q&A system is not a replacement for human study or analysis but helps ease such efforts by obtaining the preliminary information, linking the reader to the specific documents which have the most relevant answers. The same method was extended to light-scraped data, broadly covering the platform websites and their partner websites.

The PDF and light-scraped documents are stored on two different indices on Elasticsearch to run the query on the streams separately. Deep Passage Retrieval is laid on the Elasticsearch Retriever for contextual search, providing better answers. Filters of Elasticsearch can be applied on the platform/URL for the focused search on a particular platform or website. Elastic search 7.6.2 is installed on VM which is compatible with Deepset.ai Haystack. RAG is applied to the generated answers to get a specific answer. Climate risks, NbS solutions, local factors, and investment opportunities are queried on PDF data and Platform data. Questions on the platform for PDF data, URL for light scraped data can be performed for localized search.

 

Insights

By developing decision-support models and tools, we hope to make the NbS platforms’ climate change-related knowledge useful and accessible for partners of the initiative, including governments, civil society organizations, and investors at the local, regional, and national levels.

Any of these resources can be augmented with additional platform data, which would require customizing the data gathering effort per website. WRI could extend the keywords used in statistical analysis for hazards, the types of interventions, the types of ecosystems, and create guided models to gain further insights.

 

Data gathering pipeline

We have provided very useful utilities to collect and aggregate data and PDF content from websites. WRI can extend the web-scraping utility from the leading platforms and their partners to other platforms with some customization and minimal effort. Using the PDF utility, WRI can retrieve texts from any PDF files. The pre-trained multilingual model in the translation utility can translate the texts from various sources to any language.

 

Statistical analysis

Using zero-shot classification, predictions were made for the keywords that highlight Climate Hazards, Types of Interventions, and Ecosystems, based upon a selected threshold. Cosine similarity predicts the similarity of a document with regard to the keywords. Heat maps visualize both of these approaches. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations.

 

Sentiment analysis

SA identifies potential gaps from negative connotations derived from words related to challenges and hazards. A tree diagram visualizes the sentiment analysis for publications/partners/projects documents from each platform. Across all organizations, the content focuses on solutions rather than gaps. Overall, solutions and possible solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.

 

Topic Models

Topic models are useful for identifying the main topics in documents. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together.

  • With semantic search with Top2Vec. For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
  • S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
  • In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution.
  • A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku.
  • The distance in the 3D space among points represents the closeness of keywords/topics in the URL
  • The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more.
  • One can further group the URLs by color grouping and analyze the data in greater depth.
  • A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.

 

This work has been part of a project with World Resources Insitute.

Understanding Climate Change Domains through Topic Modeling

Understanding Climate Change Domains through Topic Modeling

Applying various topic modeling techniques (Top2Vec, LDA, SBERT, etc.) to extract nature-based solution (NbS) adaptation themes from text corpora.

By Bala Priya C, Nishrin Kachwala, Anju Mercian, Debaditya Shome, Farhad Sadeghlo, Hussein Jawad

 

Although many organizations have investigated the effectiveness of nature-based solutions (NBS) to help people build thriving urban and rural landscapes, such solutions have not yet been analyzed to the full extent of their potential. With this in mind, World Resources Institute (WRI) partnered with Omdena to understand how regional and global NbS can be leveraged to address and reduce the impact of climate change.

Our objective was to understand how three major coalitions, all of which embrace the key NBS forest and landscape restoration, use their websites to build networks. We used a systematic approach and to identify the climate adaptation measures that these platforms and their partners feature on their websites.

The goal of the African Forest Landscape Restoration Initiative (AFR100) in Africa and Initiative20x20 in Latin America and the Caribbean is to restore and protect forests, farms, and other landscapes to support the wellbeing of local people.Cities4Forests partners with leading cities to connect and invest in inner (urban parks), nearby (watersheds), and faraway forests (like the Amazon).

As the first step, information from the three NbS platforms and their partners, and relevant documents were collected using a scalable data collection pipeline that the team built.

 

Why Topic Modeling?

Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and thousands of chunks of text. Given the huge volume of text data thus obtained, and due to the infeasibility of manually analyzing the large text dataset to understand and gain meaningful insights, we had leveraged the use of Topic Modeling– a powerful NLP technique to understand the impacts, NbS approaches involved and the various initiatives in the direction.

A topic is a collection of words that are representative of specific information in text form. In the context of Natural Language Processing, extracting latent topics that best describe the content of the text is described as Topic modeling.

 

topic modeling

Source: Image hand-drawn by Nishrin Kachwala

 

Topic Modeling is effective for:

  • Discovering hidden patterns that are present across the collection of topics.
  • Annotating documents according to these topics.
  • Using these annotations to organize, search, and summarize texts.
  • It can also be thought of as a form of text mining to obtain recurring patterns of words in a corpus of text data.

The team experimented with topic modeling approaches that fall under unsupervised learning, semi-supervised learning, deep unsupervised learning, and matrix factorization. The team analyzed the effectiveness of the following algorithms in the context of the problem.

  • Top2Vec
  • Topic Modeling using Sentence BERT (S-BERT)
  • Latent Dirichlet Allocation (LDA)
  • Non-negative Matrix Factorization (NMF)
  • Guided LDA
  • Correlation Explanation (CorEx)

Top2Vec

Top2Vec — is an unsupervised algorithm for topic modeling and semantic search. It automatically detects topics present in the text and generates jointly embedded topic, document, and word vectors

 

topic modeling climate change

Source: arXiv:2008.09470v1 [cs.CL] — The topic words are the nearest word vectors to the topic vector

 

Data Sources used in this modeling approach are the data obtained from the heavy scraping of the platforms Initiative 20×20 and Cities4Forests, data from the light scraping pipeline, and combined data from all websites. Top2Vec performs well on reasonably large datasets;

There are three key steps taken by Top2Vec.

  • Transform documents to numeric representations
  • Dimensionality Reduction
  • Clustering of documents to find topics.

The topics present in the text are visualized using word clouds. Here’s one such word cloud that talks about deforestation, loss of green cover in certain habitats and geographical regions.

topic modeling climate change

Wordcloud

 

The algorithm can output from which platform and which line in the document the topics were found. This can help identify the organizations working towards similar causes.

 

Sentence-BERT (SBERT)

In this approach, we aim at deriving topics from clustered documents, using a class-based variant of Term Frequency- Inverse Document Frequency score (c-TF-IDF), which would allow extracting words that make each set of documents or class stand out as compared to the others

The intuition behind the method is as follows. When one applies TF-IDF as usual on a set of documents, one compares the importance of words between documents. For c-TF-IDF, one treats all documents in a single category (e.g., a cluster) as a single document and then applies TF-IDF. The result is a very long document per category and the resulting TF-IDF score would indicate the important words in a topic.

The S-BERT package extracts different embeddings based on the context of the word. Not only that, there are many pre-trained models available ready to be used. The number of top words that occur per Topic with its scores is shown below.

 

Topic Modeling

S-Bert Output: Clusters(header) of relevant topics(in rows) in the document with their TF-IDF scores

 

Latent Dirichlet Allocation (LDA)

Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and millions of chunks of text. Using Latent Dirichlet Allocation (LDA), a popular algorithm for extracting hidden topics from large volumes of text, we discovered topics covering NbS and Climate hazards underway at the NbS platforms.

 LDA’s approach to topic modeling is that it considers each document to be a collection of various topics. And each topic as a collection of words with certain probability scores.

In practice, the topic structure, per-document topic distributions, and the per-document per-word topic assignments are latent and have to be inferred from observed documents.

Once the number of topics is fed to the algorithm, it will rearrange the topic distribution in documents and word distribution in topics until there is an optimal composition of the topic-word distribution.

 

LDA with Gensim and Spacy

As every algorithm has its pros and cons, Gensim is no different than all.

Pros of using Gensim LDA are:

  • Provision to use N-grams for language modeling instead of only considering unigrams.
  • pyLDAvis for visualization
  • Gensim LDA is a relatively more stable implementation of LDA

Two metrics for evaluating the quality of our results are the perplexity and coherence score

  • Topic Coherence measures score a single topic by measuring how semantically close the high scoring words of a topic are.
  • Perplexity is a measure of surprise, which measures how well the topics in a model match a set of held-out documents; If the held-out documents have a high probability of occurring, then the perplexity score will have a lower value. The statistic makes more sense when comparing it across different models with a varying number of topics. The model with the lowest perplexity is generally considered the “best”.

 

 

We choose the optimal number of topics, by plotting the number of topics against the coherence scores they yield and choose the one that maximizes the coherence score. On the other hand, if the number of seen repetitions of words is high in the final results, we should choose lower values for the number of topics regardless of the lower coherence score.

One of the major causes that can help to provide better final evaluations for Gensim is the mallet library. Mallet library is an efficient implementation of LDA. It runs faster and gives better topics separation.

 

topic modeling climate change

 

topic modeling climate change

 

 

topic modeling climate change

pyLDAvis visual for Intertopic distance and most relevant Topic words

 

Visualizing the topics using pyLDAvis gives a global view of the topics and how they differ in terms of inter-topic distance. While at the same time allowing for a more in-depth inspection of the most relevant words that occur in individual topics. The size of the bubble is proportional to the prevalence of the topic. Better models have relatively large, well-separated bubbles spread out amongst the quadrants. When hovering over the topic bubble, its most dominant words appear on the right in a histogram.

A t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for visualizations. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

Using the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests), documents were scraped, and topics were extracted with Python’s LDA Gensim package. For visualizing complex data in three dimensions, we used scikit-learn’s t-SNE with Plotly.

Below is a visual of the Partner organization’s 3D projection for which the topic distributions were grouped manually. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization. Hovering over a point provides more information about the Topics referred to in the URL. One can further group the URLs by color and analyze the data in greater depth.

 

topic modeling climate change

Demo

 

Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization is an unsupervised learning algorithm.

It takes in the Term-Document Matrix of the text corpus and decomposes into the Document-Topic matrix and the Topic-Term matrix that quantifies how relevant the topics are in each document in the corpus and how vital each term is to a particular topic.

We use the rows of the resulting Term-Topic Matrix to get a specified number of topics. NMF is known to capture diverse topics in a text corpus and is especially useful in identifying latent topics that are not explicitly discernible from the text documents. Here’s an example of the topic word clouds generated on the light scraped data. When we would like the topics to be within a specific subset of interest or contextually more informative, we may use semi-supervised topic modeling techniques such as Guided LDA (or Seeded LDA) and CorEx(Correlation Explanation) models.

Guided Latent Dirichlet Allocation (Guided LDA)

Guided LDA is a semi-supervised topic modeling technique that takes in certain seed words per topic, and guides the topics to converge in the specified direction.

When we would like to get contextually relevant topics such as climate change impacts, mitigation strategies, and initiatives, setting a few prominent seed keywords per topic enables us to obtain topics that help understand the content of the text in the directions of interest.

For example, in the data from the platform cities4forests, the following are some of the seed words that were used, to get topics containing the most relevant keywords.

topic1=[“forests”,”degradation”,”deforestation”,”landscape”]

 

topic modeling climante change

Word Cloud from Analysis

 

CorEx Topic Model

CorEx is a discriminative topic model. It estimates the probability a document belongs to a topic given the content of that document’s words and can be used for discovering themes from a collection of documents, then further analysis such as clustering, searching, or organizing the collection of themes to gain insights.

The Total Correlation (TC) is a measure that CorEx maximizes when constructing the topic model. CorEx starts its algorithm with the random initialization, and so different runs can result in different topic models. A way of finding the best topic model is to run the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents). The topic’s underlying meaning is often interpreted by individuals building the models, and are given a name or category to reflect the topic’s understanding. This interpretation is a subjective exercise. Using anchor keywords domain-specific topics (NbS and climate change in our case) can be integrated into the CorEx model alleviating some interpretability concerns. The TC measure for the model with and without anchor words is below. The anchored models showing a better performance.

topic modeling climate change

The TC measure for the model with and without anchor words

 

After hyperparameter tuning the anchored model with anchor strength, anchor words, number of topics, making several runs for the best model, cleaning up of duplicates, the top topic is shown below.

Topic #5:

plants animals, socio bosque, sierra del divisor, national parks, restoration degraded, water nutrients, provide economic, restoration project

An Interpretation => National parks in Peru and Ecuador, which were significantly losing hectares to deforestation, are in restoration by an Initiative 20×20-affiliated project. This project also protects the local economy and endangered animal species.

 

Wrapping up

From the analysis of various topic modeling approaches, we summarize the following.

  • Compared to other topic modeling algorithms Top2vec is easy to use and the algorithm leverages joint document and word semantic embedding to find topic vectors, and does not require the text pre-processing steps of stemming, lemmatization, or stop words removal.
  • For Latent Dirichlet Allocation, the necessary text pre-processing steps are needed to obtain optimal results. As the algorithms do not use contextual embeddings, it’s not possible to account for semantic relationships completely even when considering n-gram models.
  • Topic Modeling is thus effective in gaining insights about latent topics in a collection of documents, which in our case was domain-specific, concerning documents from platforms addressing climate change impacts.
  • Limitations of topic modeling include the requirement of a lot of relevant data and consistent structure to be able to form clusters and the need for domain expertise to interpret the relevance of the results. Discriminative models with domain-specific anchor keywords such as CorEx can help in topic interpretability.

 

References

Visualizing Climate Change Impacts and Nature Based Solutions

Visualizing Climate Change Impacts and Nature Based Solutions

Applying various data science tools and methods to visualize climate change impacts.

By Nishrin Kachwala, Debaditya Shome, and Oscar Chan

Day by day, as we generate exponentially more data, we also sift through its complexity and consume more. Filtering out relevancy is essential to get to the gist of the data in front of us. It is a well-known fact that the human brain absorbs a picture 60,000 times faster than texts. And that about 65% of humans are visually inclined.

To tell a climate-change-related data’s story beyond analysis and investigation, we needed to analyze trends and support decision making. Visualizing the information is necessary for practical data science —  to explore the data, preprocess it, tune the model to the data, and ultimately to gain insights to take action.

No data story is complete without the inclusion of great visuals.

 

The Project

Understanding the impact of Nature-based solutions on climate change

The World Resources Institute (WRI) sought to understand the regional and global landscape of Nature-based Solutions (NbS).

  • How are some NbS platforms addressing climate hazards?
  • What type of NbS solutions are adapted?
  • What barriers and opportunities exist, etc.

The focus was initially on three platforms, AFR100, Cities4Forests, and Initiative20x20, and later scale the work to more platforms.

More than 30 Omdena AI engineers worked on this NLP problem to derive several actionable insights, develop a recommendation and Knowledge-based Q&A system to query the data from the NbS platforms, and extract sentiments from the data to find potential gaps. Topic Modeling was applied to derive dominant topics from the data, Website Network analysis of organizations, and statistical analysis helped to explore the involvement of `climate change impacts, ‘interventions’ and ‘ecosystems’ for the three platforms.

Using Streamlit, we built a highly interactive shareable web application (dashboard) to zoom into NLP results for actionable insights on Nature-based solutions. The Streamlit app was deployed to the web using Heroku. A major advantage of using Streamlit is that it allows developers to build a sophisticated dashboard with multiple elements, such as Plotly graph objects, tables, and interactive controlling objects, with Python scripts instead of additional HTML codes for further layout definition. This allows the incorporation of multiple project outputs on the same dashboard swiftly with minimal codes.

 

Climate change impacts

A full overview of the WRI Climate Change Dashboard

 

Overview of the Dashboard

 

Climate Change Impacts

Dashboard Elements

The dashboard consists of five major sections of the results, where users can navigate across each section using the navigation pull-down menu on the left side-bar, and use other functionalities on the side-bar to select the content they would like to see. The following will describe the components in each of the sections.

 

climate change impacts

Changes in land cover

 

Choropleth Map View

Choropleth maps use colors on a diverging scale to represent a changed situation. A diverging color scale for countries represents the magnitude of climate change over time.

The analysis considers yearly data of country-level climate and landscape parameters, such as land type cover, temperature, and soil moisture, across the major platforms’ participating countries. Deforestation evaluation used the Hansen and MODIS Land Cover Type datasets. The temperature change analysis used the MODIS Land Surface Temperature dataset. And the NASA-USDA SMAP Global Soil Moisture dataset was used to assess land degradation. Each year’s changes in the climate parameters are computed compared to the earliest year available in the data. The calculated changes each year are plotted on the choropleth maps based on the predefined diverging color scale, and users can select the year to view using the slider above the map on the dashboard.

 

climate change impacts

Changes in temperature

Take the change in temperature across participating countries as an example. The graph shows that the average yearly temperature in most South American countries and Central-Eastern African countries in 2019 decreased by around 0.25 to 1.3 °C compared to 2015. In contrast, there is an increase in the heat level of participating countries in northern Africa and Mexico, where the temperature in these countries has increased compared to 2015. Such a difference in temperature change can therefore be easily represented by the diverging color scale, where red represents an increase in heat and blue represents a decline.

 

Heat Map View

Heat maps represent the intensity of attention from the nature-based solution platforms and how each of the climate risks matches with the NbS intervention across platforms. The two heat maps illustrate measurements of attention intensity from each NbS platform. The first is a document frequency and the second a calculation of hazard to ecosystem match scores. Users can filter their data visualization of interest using the checkbox on the sidebar, the pull-down menu on the top-left corner, and selecting the corresponding NbS platform.

 

climate change impacts

Heatmap

 

As an example, the heatmap above shows the number of documents and websites related to climate impacts and the corresponding climate intervention strategies from the initiative 20×20 platform. Users can see that the land degradation problem has received the most attention from the platform, where restoration, reforestation, restorative farming, and agroforestry are the major climate intervention strategies that are correlated with the land degradation problem. Besides, the heatmap shows that the attention for the solutions for some climate risks such as wildfires, air and water pollution, disaster risk, bushfires, coastal erosion on the initiative 20×20 platform is relatively limited compared to other risks.

Apart from the heatmap itself, the dashboard design allows rooms for linking to external resources based on the information presented on the heatmap. Similar to the interactive tool in the Nature-based Solutions Evidence Platform by the University of Oxford where users can access the external cases by clicking on heatmaps, users can use the pull-down menus below the heatmap to browse the list of links and documents for each of the document numbers represented. For example, the attached figure shows the results when users select the restoration effort in response to land degradation on initiative 20×20, where users can read the brief descriptions of the page, the keywords and access the external site by clicking on the hyperlink.

 

Climate change impact

Website overview

 

Potential gap/solution identification

This section presents the results of our Sentiment analysis models. The goal was to identify which Projects / Publications / Partners of the major NbS platforms were addressing Potential Gaps or solutions for climate change. A Gap is a negative sentiment, which means it has some negative impact on climate change. Similarly, a solution is a positive sentiment, which implies that it has a positive impact on climate change. The output of this sentiment analysis subtask were three Hierarchical data frames, each on Projects, Publications, and Partners of AFR100, Initiative20x20, and Cities4forests. To present these huge data frames in a compact manner, we used Treemap and sunburst plots. Treemap charts visualize hierarchical data using nested rectangles. Sunburst plots visualize hierarchical data spanning outwards radially from root to leaves. The hierarchical grouping has been done based on the three platforms and then showing inside a platform which countries are there, and then the projects associated with them, and then if you click deeper, it shows the description and keywords for that project. The size of a rectangular box / Sector represents how much certain that there’s a potential gap/solution.

 

climate change impact

Gap/ solution potential

 

Graphical Analysis

This pull-down tab consists of the network analysis and knowledge graphs. Knowledge Graphs (KGs) represent raw information(in our case texts from NbS platforms) in a structured form, capturing relationships between entities.

In Network analysis, concepts(nodes) are identified from the words in the text and the edges between the nodes represent relations between the concepts. The network can help one visualize the general structure of the underlying text in a compact way. In addition, latent relations between concepts become visible, which are not explicit in the text. Visualizing texts as networks allow one to focus on important aspects of the text without reading large amounts of the texts. Visuals for Knowledge graphs and Network Analysis can be seen in the GIF above.

 

Knowledge-based Question-Answer System

Knowledge-based Question & Answering NLP system aims to answer questions in the context of text scraped data from the NbS platform and PDF documents available on the NbS platform websites. The system is built on the open-source Deepset.ai Haystack framework and hosted on a virtual machine, accessible via REST API and the Streamlit Dashboard.

Read more about the Q&A NLP system in this article.

 

Recommendation System

The recommendation system uses content-based filtering or collaborative filtering. Collaborative Filtering uses the “wisdom of the crowd” to recommend items. Our collaborative recommendations are based on indicators from World bank data and keyword similarity using the Starspace model by Facebook. In the dashboard, one can select multiple indicators for a platform and platforms related to the selected one

Content-based filtering recommendation is based on the description of an item and a profile of the user’s preference.

Content-based filtering guesses similar organizations, projects, news articles, blog articles, publications, etc. for a selected organization. The starspace model was used to get the word embeddings, and then a similarity analysis was done comparing the description of the selected organization and all the other organization’s data sets. Different Projects, Publications, News articles, etc. can be selected as options, using which related organizations can be recommended.

 

Keyword Analysis of Partner Organizations

This section includes an intuitive 3D t-SNE visualization of all keywords/topics in the 12801 unique URLS inside 34 Partner organization websites. The goal of each organization as displayed in the hover label was the output from Topic modeling with Latent Dirichlet Allocation (LDA).

What is a t-SNE plot?

t-SNE is an algorithm for dimensionality reduction that is well-suited for visualizing high dimensional data. TSNE stands for t-distributed Stochastic Neighbor Embedding. The idea is to embed high-dimensional points in low dimensions in a way that respects similarities between points.

We got the embeddings for every URL’s entire texts using the widely known Sentence Transformer by HuggingFace. These high dimensional embeddings were used as input to the t-SNE model which gave output projections in 3 dimensions. These projections are seen below in the interactive 3D visualization.

Advantages of this visual?

There were 12801 URLs under these 34 organizations, going through all of them and figuring out what each URL talks about would take a huge amount of time, as some websites themselves had nearly 1M words in their About section. This visual can be of help for anyone who wants to know what’s being discussed by each organization without having to manually go through those URL’s descriptions.

Today, data visualization has become an essential part of the story, no longer a pleasant enhancement but adding depth and perspective to a story. For our case, Geo-plots, heatmaps, network diagrams, Treemaps, drop down and filter elements, 3D interactive plots guide the reader step-by-step through the narrative.

We have only explored a few visuals from the multitude available and developed by the Omdena Data Science enthusiasts. With the Visual Dashboard we hope to provide a more robust connection between critical insights about Nature-based Solutions and their adaptation to the viewer. The dashboard is portable and can be shared amongst the climate change community, driving engagement, and birthing new ideas.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here