Nature-based solutions (NbS) can help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes.
By Bala Priya, Simone Perazzoli, Nishrin Kachwala, Anju Mercian, Priya Krishnamoorthy, and Rosana de Oliveira Gomes
NbS harness nature to tackle environmental challenges that affect human society, such as climate change, water insecurity, pollution, and declining food production. NbS can also help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes through, for example, growing trees in rural areas to boost crop yields and lock water in the soil. Although many organizations have investigated the effectiveness of NbS, these solutions have not yet been analyzed to the full extent of their potential.
In order to analyze such NbS approaches in greater detail, World Resources Institute (WRI) partnered with Omdena to better understand how regional and global NbS, such as forest and landscape restoration, can be leveraged to address and reduce climate change impacts across the globe.
In an attempt to identify and analyze such approaches, we investigated three main platforms that bring organizations together to promote initiatives that restore forests, farms, and other landscapes and enhance tree cover to improve human well-being:
Initiative 20×20: goal to begin restoring or protecting 50 million hectares by 2030 across Latin America and the Caribbean
Cities4Forests: partner with leading cities around the world to protect and restore forests
Considering the aforementioned, the project goal is to assess the network of these three coalition websites through a systematic approach and to identify climate adaptation measures covered by these platforms and their partners.
The integral parts of the project’s workflow included building a scalable data collection pipeline to scrape the data from the platforms and partnering organizations, and several useful PDFs; leveraging several techniques from Natural Language Processing, such as building a Neural Machine Translation pipeline to translate non-English text to English, performing sentiment analysis for identifying potential gaps, experimenting with language models that were optimal for the given use cases, exploring various supervised and unsupervised topic modeling techniques to get meaningful insights and latent topics present in the voluminous text data collected, leveraging the novel Zero Shot Classification(ZSC) to identify the impacts and interventions, building a Knowledge-Based Question Answering(KBQA) system, and recommender system.
The platforms engaged in climate-risk mitigation were studied for several factors, including the climate risks in each region, as well as initiatives taken by the platforms and their partners, NbS employed for mitigating climate risks, the effectiveness of adaptations, goals, road map of the platform, among others. This information was gathered through:
a) Heavy scraping of Platform websites: It involved data scraping from all the platforms´ website pages using Python scripts. This process involved manual effort in customizing the scraping suitable for each page; accordingly, extending this model would involve some effort. Approximately 10MB of data was generated through this technique.
b) Light Scraping of Platform websites and partner organizations: it involved the obtention of the platforms sitemap. Once it was done, organization websites were crawled to obtain the text information. This method can be extended to other platforms with minimal effort. The volume of this data generated is around 21MB.
c) PDF Text Data Scraping of Platform and Other Sites: The platform websites presented several informative PDF documents (including reports and case studies), which were helpful for use in the downstream models, including the Q&A system, recommendation system, etc. This process was completely automated by the PDF text-scraping pipeline, which prepares a CSV of the PDF data and then generates a consolidated CSV file containing paragraph text from all the PDFs mentioned in the input CSV file. This pipeline can be incrementally used to generate the PDF text in batches. The NLP models utilized all of the PDF documents from the three platform websites, as well as some documents containing general information on NbS referred by WRI. Approximately 4MB of data was generated from the available PDFs.
a) Data Cleaning: an initial step comprising the removal of unnecessary text, text with length <50, as well as duplicates.
b) Language Detection and Translation: This step involved the development of a pipeline for language detection and translation to be applied to text data gathered from the 3 main sources described above.
Language Detection was performed by analyzing text using different deep learning pre-trained models such as langdetect and pycld3. Once the language is detected, the result is used as an input parameter for the translation pipeline. In this step, pre-trained multilingual models are downloaded from the Helsinki NLP repository available at HuggingFace.com (an NLP processing company). Text is tokenized and organized in batches to be sequentially fed into the pre-trained model. To enhance function performance, the pipeline was developed with GPU support, if available. Also, once a model is downloaded. it is cached into the program memory so it doesn’t need to be downloaded again.
The translation performed well with the majority of texts to which it was applied (most were in Spanish or Portuguese), being able to generate well-structured and coherent results, especially considering the scientific vocabulary of the original texts.
c) NLP preparation: This step was applied to the CSVs files generated through the scrap, after translation pipeline and was composed by Punctuation removal, Stemming, Lemmatization, Stop words removal, Part of Speech tagging (POS), Tagging, Chunking
Statistical Analysis was performed on the preprocessed data by exploring the role of climate change impacts, interventions, and ecosystems involved in the three platforms’ portfolios using two different approaches: zero-shot classification and cosine similarity.
1.) Zero Shot Classification: This model assigns probabilities to which user-defined labels a text would best fit. We applied a zero-shot classification model from Hugging Face to classify descriptions for a given set of keywords belonging to climate-change impacts, interventions, and ecosystems for each of the three platforms. For ZSC, it combined the heavy-scraped datasets into one CSV for each website. The scores computed by ZSC can be interpreted as probabilities that the class belongs to a particular description. As a rule, we only considered as relevant those scores at or above 0.85.
Let’s consider the following example:
Description: The Greater Amman Municipality has developed a strategy called Green Amman to ban the destruction of forests The strategy focuses on the sustainable consumption of legally sourced wood products that come from sustainably managed forests The Municipality sponsors the development of sustainable forest management to provide long term social economic and environmental benefits Additional benefits include improving the environmental credentials of the municipality and consolidating the GAMs environmental leadership nationally as well as improving the quality of life and ecosystem services for future generations.
Model Predictions: The model assigned the following probabilities based upon the foregoing description:
For the description above, we see that the Climate-Change-Impact prediction is ‘loss of vegetation’, Types-of- Intervention prediction is ‘management’ or ‘protection’, and the Ecosystems prediction is empty.
2.) Cosine Similarity: Cosine Similarity compares vectors created by keywords, generated through Hugging Face models…. (how these keywords were computed) and descriptions, and scores the similarity in direction of these vectors. We then plot the scores with respect to technical and financial partners and a set of keywords. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations. This approach was useful to validate the results of the ZSC approach.
Aligning these results, it was possible to answers the following questions:
What are the climate hazards and climate impacts most frequently mentioned by the NbS platforms’ portfolios?
What percentage of interventions/initiatives take place in highly climate-vulnerable countries or areas?
What ecosystem/system features most prominently in the platforms when referencing climate impacts?
This model was applied on descriptions from all three heavy-scraped websites, and compared cross-referenced results (such as Climate Change Impact vs Intervention, or Climate Change Impact vs Ecosystems, or Ecosystems vs Intervention) for all three websites. Further, we created plots based on country and partners (technical and financial) for all three websites.
Sentiment Analysis (SA) is the automatic generation of sentiment from text, utilizing both data mining and NLP. Here, SA is applied to identify potential gaps and solutions through the corpus text extracted from the three main platforms. In this Task, it implemented the following well consolidated unsupervised approaches: VADER, TextBlob, AFINN, FlairNLP, AdaptNLP Easy Sequence Classification. A new approach, Bert-clustering, was proposed by Omdena Team and it is based on Bert embedding of a positive/negative keywords list and computing distance(s) of these embedded descriptions to the corresponding cluster, were:
negative reference: words related to challenges and hazards, which give us a negative sentiment
positive reference: words related to NBS solutions, strategies, interventions, and adaptations outcomes, which give us a positive sentiment
For modeling purposes, the threshold values adopted are presented in table 1.
According to the scoring of the models presented in Table 2, AdaptNLP, Flair, and BERT/clustering approaches exhibited better performance compared to the lexicon-based models. Putting the limitations of unsupervised learning aside, BERT/clustering is a promising approach that could be improved for further scaling. SA can be a challenging task, since most algorithms for SA are trained on ordinary-language comments (such as from reviews and social media posts), while the corpus text from the platforms has a more specialized, technical, and formal vocabulary, which raises the need to develop a more personalized analysis, such as the BERT/clustering approach.
Across all organizations, it was observed that the content focuses on solutions rather than gaps. Overall, potential solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.
Topic Modeling is a method for automatically finding topics from a collection of documents that best represent the information within the collection. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together. The algorithms/ techniques that were explored for the project include Top2Vec, SBERT: and Latent Dirichlet Allocation (LDA) with Gensim and Spacy.
Top2Vec: For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
LDA: In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution. A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, I20x20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more. One can further group the URLs by color grouping and analyze the data in greater depth. A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.
Other NLP/ML techniques
Besides the aforementioned above, other techniques were also exploited in this project and will be presented in further articles, such as
Network Analysis presents interconnections among the platform, partners, and connected websites. A custom network crawler was created, along with heuristics such as prioritizing NbS organizations over commercial linkages (this can be tuned) and parsing approx. 700 organization links per site (this is another tunable parameter). We then ran the script with different combinations of source nodes (usually the bigger organizations like AFR100, INITIATIVE20x20 were selected as sources to achieve the required depth in the network). Based on these experiments, we derived a master set of irrelevant sites (such as social media, advertisements, site-protection providers, etc.) that are not crawled by our software.
Knowledge Graphs represent the information extracted from the text in the websites based on the relationships between them. A pipeline was built to extract the triplets based upon the subject / object relationship using StanfordNLP’ss OpenIE on the paragraph. Subjects and objects are represented by nodes, and relations by the paths (or “edges”) between them.
Recommendation Systems: The recommender systems application is built based upon the information extracted from the partner’s websites, with a goal to provide recommendations of possible solutions already available and implemented within the network of partners from WRI. The application allows a user to search for similarities across organizations (collaborative filtering) as well as similarities in the content of the solution (content-based filtering).
Question & Answer System: Our knowledge-based Question & Answer system answers questions in the domain context of the text scraped data from the PDF documents from the main platform websites, as well as a few domain-related PDF documents which contain the climate risks and NbS information, as well as the light-scraped data obtained from the platforms and their partner websites.
The KQnA system is based on Facebook’s Deep Passage Retrieval method which provides better context by generating vector embeddings. The RAG neural network(RAG) generates a specific answer for a given question conditioned on the retrieved documents. RAG gives the most of an answer from the shortlisted documents. The KQnA system is built on the open-source Deepset.ai Haystack framework and hosted on a virtual machine, accessible via REST API to the Streamlit UI.
The platform websites have many PDF documents containing extensive significant information that would take a lot of time for humans to process. The Q&A system is not a replacement for human study or analysis but helps ease such efforts by obtaining the preliminary information, linking the reader to the specific documents which have the most relevant answers. The same method was extended to light-scraped data, broadly covering the platform websites and their partner websites.
The PDF and light-scraped documents are stored on two different indices on Elasticsearch to run the query on the streams separately. Deep Passage Retrieval is laid on the Elasticsearch Retriever for contextual search, providing better answers. Filters of Elasticsearch can be applied on the platform/URL for the focused search on a particular platform or website. Elastic search 7.6.2 is installed on VM which is compatible with Deepset.ai Haystack. RAG is applied to the generated answers to get a specific answer. Climate risks, NbS solutions, local factors, and investment opportunities are queried on PDF data and Platform data. Questions on the platform for PDF data, URL for light scraped data can be performed for localized search.
By developing decision-support models and tools, we hope to make the NbS platforms’ climate change-related knowledge useful and accessible for partners of the initiative, including governments, civil society organizations, and investors at the local, regional, and national levels.
Any of these resources can be augmented with additional platform data, which would require customizing the data gathering effort per website. WRI could extend the keywords used in statistical analysis for hazards, the types of interventions, the types of ecosystems, and create guided models to gain further insights.
Data gathering pipeline
We have provided very useful utilities to collect and aggregate data and PDF content from websites. WRI can extend the web-scraping utility from the leading platforms and their partners to other platforms with some customization and minimal effort. Using the PDF utility, WRI can retrieve texts from any PDF files. The pre-trained multilingual model in the translation utility can translate the texts from various sources to any language.
Using zero-shot classification, predictions were made for the keywords that highlight Climate Hazards, Types of Interventions, and Ecosystems, based upon a selected threshold. Cosine similarity predicts the similarity of a document with regard to the keywords. Heat maps visualize both of these approaches. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations.
SA identifies potential gaps from negative connotations derived from words related to challenges and hazards. A tree diagram visualizes the sentiment analysis for publications/partners/projects documents from each platform. Across all organizations, the content focuses on solutions rather than gaps. Overall, solutions and possible solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.
Topic models are useful for identifying the main topics in documents. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together.
With semantic search with Top2Vec. For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution.
A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku.
The distance in the 3D space among points represents the closeness of keywords/topics in the URL
The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more.
One can further group the URLs by color grouping and analyze the data in greater depth.
A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.
This work has been part of a project with World Resources Insitute.
Although many organizations have investigated the effectiveness of nature-based solutions (NBS) to help people build thriving urban and rural landscapes, such solutions have not yet been analyzed to the full extent of their potential. With this in mind, World Resources Institute (WRI) partnered with Omdena to understand how regional and global NbS can be leveraged to address and reduce the impact of climate change.
Our objective was to understand how three major coalitions, all of which embrace the key NBS forest and landscape restoration, use their websites to build networks. We used a systematic approach and to identify the climate adaptation measures that these platforms and their partners feature on their websites.
As the first step, information from the three NbS platforms and their partners, and relevant documents were collected using a scalable data collection pipeline that the team built.
Why Topic Modeling?
Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and thousands of chunks of text. Given the huge volume of text data thus obtained, and due to the infeasibility of manually analyzing the large text dataset to understand and gain meaningful insights, we had leveraged the use of Topic Modeling– a powerful NLP technique to understand the impacts, NbS approaches involved and the various initiatives in the direction.
A topic is a collection of words that are representative of specific information in text form. In the context of Natural Language Processing, extracting latent topics that best describe the content of the text is described as Topic modeling.
Topic Modeling is effective for:
Discovering hidden patterns that are present across the collection of topics.
Annotating documents according to these topics.
Using these annotations to organize, search, and summarize texts.
It can also be thought of as a form of text mining to obtain recurring patterns of words in a corpus of text data.
The team experimented with topic modeling approaches that fall under unsupervised learning, semi-supervised learning, deep unsupervised learning, and matrix factorization. The team analyzed the effectiveness of the following algorithms in the context of the problem.
Topic Modeling using Sentence BERT (S-BERT)
Latent Dirichlet Allocation (LDA)
Non-negative Matrix Factorization (NMF)
Correlation Explanation (CorEx)
Top2Vec — is an unsupervised algorithm for topic modeling and semantic search. It automatically detects topics present in the text and generates jointly embedded topic, document, and word vectors
Data Sources used in this modeling approach are the data obtained from the heavy scraping of the platforms Initiative 20×20 and Cities4Forests, data from the light scraping pipeline, and combined data from all websites. Top2Vec performs well on reasonably large datasets;
There are three key steps taken by Top2Vec.
Transform documents to numeric representations
Clustering of documents to find topics.
The topics present in the text are visualized using word clouds. Here’s one such word cloud that talks about deforestation, loss of green cover in certain habitats and geographical regions.
The algorithm can output from which platform and which line in the document the topics were found. This can help identify the organizations working towards similar causes.
In this approach, we aim at deriving topics from clustered documents, using a class-based variant of Term Frequency- Inverse Document Frequency score (c-TF-IDF), which would allow extracting words that make each set of documents or class stand out as compared to the others
The intuition behind the method is as follows. When one applies TF-IDF as usual on a set of documents, one compares the importance of words between documents. For c-TF-IDF, one treats all documents in a single category (e.g., a cluster) as a single document and then applies TF-IDF. The result is a very long document per category and the resulting TF-IDF score would indicate the important words in a topic.
The S-BERT package extracts different embeddings based on the context of the word. Not only that, there are many pre-trained models available ready to be used. The number of top words that occur per Topic with its scores is shown below.
Latent Dirichlet Allocation (LDA)
Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and millions of chunks of text. Using Latent Dirichlet Allocation (LDA), a popular algorithm for extracting hidden topics from large volumes of text, we discovered topics covering NbS and Climate hazards underway at the NbS platforms.
LDA’s approach to topic modeling is that it considers each document to be a collection of various topics. And each topic as a collection of words with certain probability scores.
In practice, the topic structure, per-document topic distributions, and the per-document per-word topic assignments are latent and have to be inferred from observed documents.
Once the number of topics is fed to the algorithm, it will rearrange the topic distribution in documents and word distribution in topics until there is an optimal composition of the topic-word distribution.
LDA with Gensim and Spacy
As every algorithm has its pros and cons, Gensim is no different than all.
Pros of using Gensim LDA are:
Provision to use N-grams for language modeling instead of only considering unigrams.
pyLDAvis for visualization
Gensim LDA is a relatively more stable implementation of LDA
Two metrics for evaluating the quality of our results are the perplexity and coherence score
Topic Coherence measures score a single topic by measuring how semantically close the high scoring words of a topic are.
Perplexity is a measure of surprise, which measures how well the topics in a model match a set of held-out documents; If the held-out documents have a high probability of occurring, then the perplexity score will have a lower value. The statistic makes more sense when comparing it across different models with a varying number of topics. The model with the lowest perplexity is generally considered the “best”.
We choose the optimal number of topics, by plotting the number of topics against the coherence scores they yield and choose the one that maximizes the coherence score. On the other hand, if the number of seen repetitions of words is high in the final results, we should choose lower values for the number of topics regardless of the lower coherence score.
One of the major causes that can help to provide better final evaluations for Gensim is the mallet library. Mallet library is an efficient implementation of LDA. It runs faster and gives better topics separation.
Visualizing the topics using pyLDAvis gives a global view of the topics and how they differ in terms of inter-topic distance. While at the same time allowing for a more in-depth inspection of the most relevant words that occur in individual topics. The size of the bubble is proportional to the prevalence of the topic. Better models have relatively large, well-separated bubbles spread out amongst the quadrants. When hovering over the topic bubble, its most dominant words appear on the right in a histogram.
Using the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests), documents were scraped, and topics were extracted with Python’s LDA Gensim package. For visualizing complex data in three dimensions, we used scikit-learn’s t-SNE with Plotly.
Below is a visual of the Partner organization’s 3D projection for which the topic distributions were grouped manually. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization. Hovering over a point provides more information about the Topics referred to in the URL. One can further group the URLs by color and analyze the data in greater depth.
Non-negative Matrix Factorization (NMF)
Non-negative Matrix Factorization is an unsupervised learning algorithm.
It takes in the Term-Document Matrix of the text corpus and decomposes into the Document-Topic matrix and the Topic-Term matrix that quantifies how relevant the topics are in each document in the corpus and how vital each term is to a particular topic.
We use the rows of the resulting Term-Topic Matrix to get a specified number of topics. NMF is known to capture diverse topics in a text corpus and is especially useful in identifying latent topics that are not explicitly discernible from the text documents. Here’s an example of the topic word clouds generated on the light scraped data. When we would like the topics to be within a specific subset of interest or contextually more informative, we may use semi-supervised topic modeling techniques such as Guided LDA (or Seeded LDA) and CorEx(Correlation Explanation) models.
Guided Latent Dirichlet Allocation (Guided LDA)
Guided LDA is a semi-supervised topic modeling technique that takes in certain seed words per topic, and guides the topics to converge in the specified direction.
When we would like to get contextually relevant topics such as climate change impacts, mitigation strategies, and initiatives, setting a few prominent seed keywords per topic enables us to obtain topics that help understand the content of the text in the directions of interest.
For example, in the data from the platform cities4forests, the following are some of the seed words that were used, to get topics containing the most relevant keywords.
CorEx is a discriminative topic model. It estimates the probability a document belongs to a topic given the content of that document’s words and can be used for discovering themes from a collection of documents, then further analysis such as clustering, searching, or organizing the collection of themes to gain insights.
The Total Correlation (TC) is a measure that CorEx maximizes when constructing the topic model. CorEx starts its algorithm with the random initialization, and so different runs can result in different topic models. A way of finding the best topic model is to run the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents). The topic’s underlying meaning is often interpreted by individuals building the models, and are given a name or category to reflect the topic’s understanding. This interpretation is a subjective exercise. Using anchor keywords domain-specific topics (NbS and climate change in our case) can be integrated into the CorEx model alleviating some interpretability concerns. The TC measure for the model with and without anchor words is below. The anchored models showing a better performance.
After hyperparameter tuning the anchored model with anchor strength, anchor words, number of topics, making several runs for the best model, cleaning up of duplicates, the top topic is shown below.
plants animals, socio bosque, sierra del divisor, national parks, restoration degraded, water nutrients, provide economic, restoration project
An Interpretation => National parks in Peru and Ecuador, which were significantly losing hectares to deforestation, are in restoration by an Initiative 20×20-affiliated project. This project also protects the local economy and endangered animal species.
From the analysis of various topic modeling approaches, we summarize the following.
Compared to other topic modeling algorithms Top2vec is easy to use and the algorithm leverages joint document and word semantic embedding to find topic vectors, and does not require the text pre-processing steps of stemming, lemmatization, or stop words removal.
For Latent Dirichlet Allocation, the necessary text pre-processing steps are needed to obtain optimal results. As the algorithms do not use contextual embeddings, it’s not possible to account for semantic relationships completely even when considering n-gram models.
Topic Modeling is thus effective in gaining insights about latent topics in a collection of documents, which in our case was domain-specific, concerning documents from platforms addressing climate change impacts.
Limitations of topic modeling include the requirement of a lot of relevant data and consistent structure to be able to form clusters and the need for domain expertise to interpret the relevance of the results. Discriminative models with domain-specific anchor keywords such as CorEx can help in topic interpretability.
Applying various data science tools and methods to visualize climate change impacts.
By Nishrin Kachwala, Debaditya Shome, and Oscar Chan
Day by day, as we generate exponentially more data, we also sift through its complexity and consume more. Filtering out relevancy is essential to get to the gist of the data in front of us. It is a well-known fact that the human brain absorbs a picture 60,000 times faster than texts. And that about 65% of humans are visually inclined.
To tell a climate-change-related data’s story beyond analysis and investigation, we needed to analyze trends and support decision making. Visualizing the information is necessary for practical data science — to explore the data, preprocess it, tune the model to the data, and ultimately to gain insights to take action.
No data story is complete without the inclusion of great visuals.
Understanding the impact of Nature-based solutions on climate change
The World Resources Institute (WRI) sought to understand the regional and global landscape of Nature-based Solutions (NbS).
How are some NbS platforms addressing climate hazards?
What type of NbS solutions are adapted?
What barriers and opportunities exist, etc.
The focus was initially on three platforms, AFR100, Cities4Forests, and Initiative20x20, and later scale the work to more platforms.
More than 30 Omdena AI engineers worked on this NLP problem to derive several actionable insights, develop a recommendation and Knowledge-based Q&A system to query the data from the NbS platforms, and extract sentiments from the data to find potential gaps. Topic Modeling was applied to derive dominant topics from the data, Website Network analysis of organizations, and statistical analysis helped to explore the involvement of `climate change impacts, ‘interventions’ and ‘ecosystems’ for the three platforms.
Using Streamlit, we built a highly interactive shareable web application (dashboard) to zoom into NLP results for actionable insights on Nature-based solutions. The Streamlit app was deployed to the web using Heroku. A major advantage of using Streamlit is that it allows developers to build a sophisticated dashboard with multiple elements, such as Plotly graph objects, tables, and interactive controlling objects, with Python scripts instead of additional HTML codes for further layout definition. This allows the incorporation of multiple project outputs on the same dashboard swiftly with minimal codes.
Overview of the Dashboard
The dashboard consists of five major sections of the results, where users can navigate across each section using the navigation pull-down menu on the left side-bar, and use other functionalities on the side-bar to select the content they would like to see. The following will describe the components in each of the sections.
Choropleth Map View
Choropleth maps use colors on a diverging scale to represent a changed situation. A diverging color scale for countries represents the magnitude of climate change over time.
The analysis considers yearly data of country-level climate and landscape parameters, such as land type cover, temperature, and soil moisture, across the major platforms’ participating countries. Deforestation evaluation used the Hansen and MODIS Land Cover Type datasets. The temperature change analysis used the MODIS Land Surface Temperature dataset. And the NASA-USDA SMAP Global Soil Moisture dataset was used to assess land degradation. Each year’s changes in the climate parameters are computed compared to the earliest year available in the data. The calculated changes each year are plotted on the choropleth maps based on the predefined diverging color scale, and users can select the year to view using the slider above the map on the dashboard.
Take the change in temperature across participating countries as an example. The graph shows that the average yearly temperature in most South American countries and Central-Eastern African countries in 2019 decreased by around 0.25 to 1.3 °C compared to 2015. In contrast, there is an increase in the heat level of participating countries in northern Africa and Mexico, where the temperature in these countries has increased compared to 2015. Such a difference in temperature change can therefore be easily represented by the diverging color scale, where red represents an increase in heat and blue represents a decline.
Heat Map View
Heat maps represent the intensity of attention from the nature-based solution platforms and how each of the climate risks matches with the NbS intervention across platforms. The two heat maps illustrate measurements of attention intensity from each NbS platform. The first is a document frequency and the second a calculation of hazard to ecosystem match scores. Users can filter their data visualization of interest using the checkbox on the sidebar, the pull-down menu on the top-left corner, and selecting the corresponding NbS platform.
As an example, the heatmap above shows the number of documents and websites related to climate impacts and the corresponding climate intervention strategies from the initiative 20×20 platform. Users can see that the land degradation problem has received the most attention from the platform, where restoration, reforestation, restorative farming, and agroforestry are the major climate intervention strategies that are correlated with the land degradation problem. Besides, the heatmap shows that the attention for the solutions for some climate risks such as wildfires, air and water pollution, disaster risk, bushfires, coastal erosion on the initiative 20×20 platform is relatively limited compared to other risks.
Apart from the heatmap itself, the dashboard design allows rooms for linking to external resources based on the information presented on the heatmap. Similar to the interactive tool in the Nature-based Solutions Evidence Platform by the University of Oxford where users can access the external cases by clicking on heatmaps, users can use the pull-down menus below the heatmap to browse the list of links and documents for each of the document numbers represented. For example, the attached figure shows the results when users select the restoration effort in response to land degradation on initiative 20×20, where users can read the brief descriptions of the page, the keywords and access the external site by clicking on the hyperlink.
Potential gap/solution identification
This section presents the results of our Sentiment analysis models. The goal was to identify which Projects / Publications / Partners of the major NbS platforms were addressing Potential Gaps or solutions for climate change. A Gap is a negative sentiment, which means it has some negative impact on climate change. Similarly, a solution is a positive sentiment, which implies that it has a positive impact on climate change. The output of this sentiment analysis subtask were three Hierarchical data frames, each on Projects, Publications, and Partners of AFR100, Initiative20x20, and Cities4forests. To present these huge data frames in a compact manner, we used Treemap and sunburst plots. Treemap charts visualize hierarchical data using nested rectangles. Sunburst plots visualize hierarchical data spanning outwards radially from root to leaves. The hierarchical grouping has been done based on the three platforms and then showing inside a platform which countries are there, and then the projects associated with them, and then if you click deeper, it shows the description and keywords for that project. The size of a rectangular box / Sector represents how much certain that there’s a potential gap/solution.
This pull-down tab consists of the network analysis and knowledge graphs. Knowledge Graphs (KGs) represent raw information(in our case texts from NbS platforms) in a structured form, capturing relationships between entities.
In Network analysis, concepts(nodes) are identified from the words in the text and the edges between the nodes represent relations between the concepts. The network can help one visualize the general structure of the underlying text in a compact way. In addition, latent relations between concepts become visible, which are not explicit in the text. Visualizing texts as networks allow one to focus on important aspects of the text without reading large amounts of the texts. Visuals for Knowledge graphs and Network Analysis can be seen in the GIF above.
Knowledge-based Question-Answer System
Knowledge-based Question & Answering NLP system aims to answer questions in the context of text scraped data from the NbS platform and PDF documents available on the NbS platform websites. The system is built on the open-source Deepset.ai Haystack framework and hosted on a virtual machine, accessible via REST API and the Streamlit Dashboard.
The recommendation system uses content-based filtering or collaborative filtering. Collaborative Filtering uses the “wisdom of the crowd” to recommend items. Our collaborative recommendations are based on indicators from World bank data and keyword similarity using the Starspace model by Facebook. In the dashboard, one can select multiple indicators for a platform and platforms related to the selected one
Content-based filtering recommendation is based on the description of an item and a profile of the user’s preference.
Content-based filtering guesses similar organizations, projects, news articles, blog articles, publications, etc. for a selected organization. The starspace model was used to get the word embeddings, and then a similarity analysis was done comparing the description of the selected organization and all the other organization’s data sets. Different Projects, Publications, News articles, etc. can be selected as options, using which related organizations can be recommended.
Keyword Analysis of Partner Organizations
This section includes an intuitive 3D t-SNE visualization of all keywords/topics in the 12801 unique URLS inside 34 Partner organization websites. The goal of each organization as displayed in the hover label was the output from Topic modeling with Latent Dirichlet Allocation (LDA).
What is a t-SNE plot?
t-SNE is an algorithm for dimensionality reduction that is well-suited for visualizing high dimensional data. TSNE stands for t-distributed Stochastic Neighbor Embedding. The idea is to embed high-dimensional points in low dimensions in a way that respects similarities between points.
We got the embeddings for every URL’s entire texts using the widely known Sentence Transformer by HuggingFace. These high dimensional embeddings were used as input to the t-SNE model which gave output projections in 3 dimensions. These projections are seen below in the interactive 3D visualization.
Advantages of this visual?
There were 12801 URLs under these 34 organizations, going through all of them and figuring out what each URL talks about would take a huge amount of time, as some websites themselves had nearly 1M words in their About section. This visual can be of help for anyone who wants to know what’s being discussed by each organization without having to manually go through those URL’s descriptions.
Today, data visualization has become an essential part of the story, no longer a pleasant enhancement but adding depth and perspective to a story. For our case, Geo-plots, heatmaps, network diagrams, Treemaps, drop down and filter elements, 3D interactive plots guide the reader step-by-step through the narrative.
We have only explored a few visuals from the multitude available and developed by the Omdena Data Science enthusiasts. With the Visual Dashboard we hope to provide a more robust connection between critical insights about Nature-based Solutions and their adaptation to the viewer. The dashboard is portable and can be shared amongst the climate change community, driving engagement, and birthing new ideas.
How to create a powerful NLP Q&A system in 8 weeks, that resolves queries on a domain-specific knowledge base?
Authors: Aruna Sri T., Tanmay Laud
The above snapshot is a classic example of how our Q&A system works. Users can ask questions pertaining to climate hazards (or their solutions) and the model will carve out relevant answers from the collected Knowledge Base.
The model suggests ‘Mangrove forests’, ‘improved river-flow regulation’, and ‘stone bunds’ to tackle floods.
What’s more? We also pinpoint the context that was used to generate the answer along with a hyperlink for those who want to go into more depth (panaroma.solutionsin this case).
Let’s explore why we built this system in the first place ?
Why Q&A Systems for Nature-based Solutions?
World Resources Institute (WRI) seeks to understand how nature-based solutions (NbS) like forest and landscape restoration can minimize the impacts of climate change on local communities.
Nature-based Solutions (NbS) are a powerful ally to address societal challenges, such as climate change, biodiversity loss, and food security. As the world strives to emerge from the current pandemic and move towards the UN Sustainable Development Goals, it is imperative that future investments in nature reach their potential by contributing to the health and well-being of people and the planet.
There is a growing interest from governments, business, and civil society in the use of nature for simultaneous benefits to biodiversity and human well-being.
The websites of the initiatives that advocate for NbS are full of information like annual reports, climate risks of the region, impact in the region, how the adaptations are helping in mitigating climate effects, challenges faced, local factors, socio-economic conditions, employment, investment opportunities, government policies, community involvement, inter-platform efforts, etc, which takes a lot of manual effort for studying. The QnA system helps to ease this effort in getting the preliminary information on the focus areas and the reader can then go to specific documents that have relevant answers.
Thus, our system is not a replacement for the human study or analysis, but a tool that highlights the focal points of key information by giving quick and relevant answers from the curated Knowledge Base.
The tool shall provide Climate Adaptation researchers an ability to search a massive quantity of textual information and analyze the impact of NbS on society, ecosystems. Broadly, it answers questions such as
How are regional/global platforms addressing climate change impacts?
What is the current state of landscapes, barriers, and opportunities?
When and how are the NbS being implemented and in which regions?
But, how did we do it? Let us deep-dive into the details of our project.
Data Collection and Preprocessing
The focus of this project was to study the following Coalitions that bring together organizations to promote forest and landscape restoration, enhancing human well-being:
• AFR100 (the African Forest Landscape Restoration Initiative to place 100 million hectares of land into restoration by 2030)
• Initiative 20×20 (the Latin American and Carribean initiative to protect and restore 50 million hectares of land by 2030)
• Cities4Forests (leading cities partnering to combat climate change, protect watersheds and biodiversity, and improve human well-being)
In this article, we shall focus on the PDF text scraping pipeline and SOTA Knowledge-Based Question Answers system.
Deep-dive into PDF text scraping
The websites on our radar have a good number of PDF documents that contain a plethora of information like annual reports, case studies, and research (insights) which is useful to policymakers and domain experts to understand the latest trends across the world. This text extraction process is automated by using GROBID.
GROBID (or GeneRation Of BIbliographic Data) is a machine learning library for extracting, parsing, and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. TIE XML is the industry standard for document content without presentation part. We downloaded and ran the open-source server on our local system. In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author names, affiliation types, detailed address, journal, volume, issue, etc.) to full-text structures (section title, paragraph, reference markers, head/foot notes, figure headers, etc.)
The steps are as follows:
The pipeline takes a list of PDF URLs and generates a consolidated CSV file containing paragraph text from all the PDF documents along with metadata such as the Source system, the download link to PDF, Title of paragraphs, etc.
All the PDF documents in the three platform websites and some documents which have generic information on nature-based solutions are included
The documents provided are then converted to TIE XML format using GROBID service.
These TIE XML documents are further scraped for headers and paragraphs using python’s Beautiful Soup parser.
This utility replaces the manual effort of extracting paragraph text from PDF documents which can be used for further analysis by downstream NLP models.
A Knowledge-Based Question Answering System
The knowledge-based Question & Answering (KQnA) NLP system aims to answers the questions in the domain context on the text scraped data from PDF publications and Partner websites.
The KQnA system is based on Facebook’s Dense Passage Retrieval (DPR) method. Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. DPR embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy.
Retrieval-Augmented Generation (RAG) model combines the powers of pre-trained dense retrieval (DPR) and sequence-to-sequence models. RAG model retrieves documents, passes them to a seq2seq model, then marginalizes them to generate outputs. The retriever and seq2seq modules are initialized from pre-trained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks. It is based on the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
Building the above system from scratch has its own set of challenges as the process is time-consuming and prone to bugs. Hence, we utilized an open-source framework called haystack provided by deepset.ai for our QnA pipeline. Let us go through our custom model pipeline in detail.
The model is powered by a Retriever-Reader pipeline in order to optimize for both speed and accuracy.
Elastic Search on VM was used as a Document store for the storage of documents. The PDF and Website text are stored on two different indices on Elastic search to run on-demand queries on the two streams separately. Filters of Elastic search can be applied on the platform/Url for the focussed search on a particular platform or website. Elastic search 7.6.2 was installed on VM which is compatible with Haystack (as of December 2020).
Readers are powerful models that do a close analysis of documents and perform the core task of question answering.
We used the FARM Reader, which makes Transfer Learning with BERT & Co simple, fast, and enterprise-ready. It’s built upon transformers and provides additional features to simplify the life of developers: Parallelized preprocessing, highly modular design, multi-task learning, experiment tracking, easy debugging, and close integration with AWS SageMaker. With FARM you can build fast proofs-of-concept for tasks like text classification, NER, or question answering and transfer them easily into production.
The Retriever assists the Reader by acting as a lightweight filter that reduces the number of documents that the Reader has to process. It does this by:
Scanning through all documents in the database
Quickly identifying the relevant and dismissing the irrelevant
Passing on only a small candidate set of documents to the Reader
RAG is applied to the generated answers to get a specific answer.
The complete pipeline is as follows:
The model is hosted on Omdena AWS server and exposed via a REST API. We used Streamlit to produce a fast and scalable dashboard.
We started this article with an example question around floods but the target search was on a website (panaroma.solutions). The above animation presents a similar question, but this time, the search is performed on PDFs. We get an answer “moving flood protection infrastructures” and “levees and dams”.
A more refined search such as “What is the best way to tackle floods?” returns the following:
We observe that the system does not answer “levees and dams” since those are not nature-based solutions. Instead, we get documents that talk about ‘ecosystem-based coastal defenses’
Let us look at a few more examples:
“What is the impact of NbS ?” — search all websites
“What is the impact of NbS ?” — search on climatefocus.com
Here, we can understand the general notion around NbS and compare that to content from specific websites such as climatefocus.com. It allows us to contrast the work being done by different websites and how they approach the “NbS”. (In comparison to the earlier query for all websites, now it is run for a single website)
NOTE: The questions posted above are just samples and do not directly indicate an outcome from the Omdena project. The work is a Proof-Of-Concept and requires fine-tuning and analysis by domain experts before productionizing the system.
The KQnA system is found to give pertinent answers to the questions based on context. Its performance is dependent on two factors:
The quality of the data ingested
The framing of the question ( It may require a few attempts to ask the question in the right way in order to get the answer you are looking for)
The system can be enhanced by adding more data from PDF documents and websites to the database. Further, if a labeled dataset for climate change is available in the future, then the model can be fine-tuned so that it can better predict the documents and extract the right answers from the database in the context of Climate Change and/or Nature-Based Solutions.
The Artificial Intelligence Imperative: Early Detection of Wildfires
Founded in 2016, Sintecsys is a young and growing Brazilian company with a passion for its customers, communities, and the climate. Its real-time fire outbreak detection and management system protect forests and agricultural land in 7 Brazilian states and 4 biomes from fire to help companies reduce their asset losses, protect local people from smoke-related respiratory illnesses, and reduce CO2 emissions that lead to global warming.
The company was in the midst of preparing for its next stage of growth and wanted to enhance its solution’s detection capability with artificial intelligence to scale its deployment and customer base.
Innovation Through Collaboration
Early and accurate detection of fires was a core system capability and therefore a key-value feature requiring a highly-accurate artificial intelligence model to detect wildfires. The upgraded system’s speed to market was also critical from a competitive and financial perspective, so the development process needed to run quickly, efficiently, and meet deadlines. Finally, the Syntecsys internal technical team was in a state of transition as part of its overall company transformation, so effective team communication or coordination throughout the project was also a key success factor.
To best address these challenges and the exploratory nature of the project, Omdena used its 8-week Challenge approach and assembled a diverse 47-member team of skilled and motivated artificial intelligence expert collaborators from 22 countries. Together with the Sintecsys team, the collaborators first determined the problem to solve and deliverable: build a model that can detect smoke from wildfires outbreaks using daytime camera footage from Sintecsys towers, with nighttime detection being saved for a later development phase. The full team researched potential modeling techniques and organized themselves into several task groups, each led by a task manager and focused on testing a promising approach like mobile net, semantic segmentation, and Convolutional Neural Networks. About 20 collaborators also formed a task group to create two training datasets by labeling nearly 9,000 images that could be used by any of these modeling techniques.
The team had to solve two key data challenges affecting the model’s ability to accurately identify smoke. First, they used the label smoothing techniques to help classify real smoke and ‘smoke-like’ anomalies such as camera glare, fog, clouds, and smoke released from boilers. Second, the team applied an upsampling technique to help the model overcome the low quality of many of the images. After multiple rounds of model testing and validation, the final model was able to detect smoke images an impressive 95% to 97% of the time with a false positive rate of only 10% to 33%.
The team used bottom-up collaboration to create a supportive and open environment that optimized innovation and speed. Team members organized themselves into groups that took leadership roles based on their qualifications and learning goals. Activities and communication were guided by an Agile approach with daily discussions between team members to explore multiple techniques simultaneously. Weekly calls were held with the Sintecsys team to report progress and solicit feedback to ensure a ‘best-fit’ solution.
AI’s Impact On Profit and Purpose
Omdena’s artificial intelligence solution contributes to an increase in the overall value of its fire detection and management system to customers which also amplifies its positive impact on people and the planet.
Integration of the automated, high-quality artificial intelligence model helps reduce detection time drastically while resulting in 90% fewer lost acres of crops and trees leading to lower CO2 emissions. Less smoke in the air means better respiratory health for farmers and local inhabitants. The model’s low false-positive rate reduces false alarms, translating into less time and money spent by customers, farmers, and fire brigades to investigate potential fires. A prototype system has been launched with one client. Full deployment to all clients is planned for end-2020/early-2021.
Sintecsys also plans to replicate this model in other countries and investigate with Omdena potential enhancements like detecting fire in nighttime images, incorporating satellite imagery, and predicting the most likely areas where fires will start.
Says Osmar Bambini, Sintecsys Head of Innovation, “Speed, accuracy, and power sum up my perception of Omdena. For Sintecsys, from now on Omdena is our official artificial intelligence partner.”
By Gijs van den Dool, Galina Naydenova, andAnn Chia
In an 8-week project, 50 technology changemakers from Omdena embarked on a mission to find needles in an online haystack. The project proved that using Natural Language Processing (NLP) can be very efficient to point to where these needles are hiding, especially when there are (legal) language barriers, and different interpretations between countries, regions, governmental institutes.
The World Resource Institute (WRI) identified the problem and asked Omdena to help solve it. The project was hosted on Omdena´s platform to create a better understanding of the current situation regarding enabling policies through NLP techniques like topic analysis. Policies are one of the tools decision-makers can use to improve the environment, but often it is not known which policies and incentives are in place, and which department is responsible for the implementation.
Understanding the effect of the policies involves reading and topic analysis of thousands of pages of documentation (legislation) across multiple sectors. It is precisely in this area where Natural Language Processing (NLP) can help, and assist, in the processing of policy documents, highlighting the essential documents and parts of documents, and identifying which areas are under/over-represented. A process like this will also promote the knowledge sharing between stakeholders, and enable rapid identification of incentives, disincentives, perverse incentives, and misalignment between policies.
This project aimed to identify economic incentives for forest and landscape restoration using an automated approach, helping (for a start) policymakers in Mexico, Peru, Chile, Guatemala, and El Salvador to make data-driven choices that positively shape their environment.
The project focused on three objectives:
Identifying which policies relate to forest and landscape restoration using topic analysis
Detecting the financial and economic incentives in the policies via topic analysis
Creating visualization which clearly shows the relevance of policies to forest and landscape restoration
This was achieved through the following pipeline, demonstrated through Figure 1 below:
The Natural Language Processing (NLP) Pipeline
The web scraping process consisted of two approaches: the scraping of official policy databases, and Google Scraping. This allowed the retrieval of virtually all official policy documents from the five listed countries roughly between 2016 and 2020. The scraping results were then filtered further by relevance to landscape restoration, and the final text metadata of each entry was then stored on PySQL. Thus, we were able to build a comprehensive database of policy documents for use further down the pipeline.
Text preprocessing converted the retrieved documents from a human-readable form to a computer-readable form. Namely, policy documents were converted from pdf to txt, with text contents tokenized, lemmatized, and further processed for use in the subsequent NLP models.
NLP modeling involved the use of Sentence-BERT (SBERT) and LDA topic analysis. SBERT was used to build a search engine that parses policy documents and highlights relevant text segments that match the given input search query. The LDA model was used for topic analysis, which will be the focus of this economic policies analysis article.
Finally, the web scraping results, SBERT search engine, and in the future, the LDA model outputs would be combined and the results presented into an interactive web app, allowing greater accessibility to the non-technical audience.
Applications for Natural Language Processing
All countries are creating policies, plans, or incentives, to manage land use and the environment and are part of the decision making process. Governments are responsible for controlling the effects of human activities on the environment, particularly those measures that are designed to prevent or reduce harmful effects of human activities on ecosystems, and do not have an unacceptable impact on humans. This policy-making can result in the creation of thousands of documents. The idea is to extract the economic incentives for forest and landscape restoration from the available (online) policy documents to get a better understanding of what kind of topics are addressed in these policies via topic analysis.
We developed a two-step approach to solving this problem: the first step selects the documents that are most closely related to reforestation in a general sense, and the second step points out the segments of those documents stating economic incentives. To mark which policies are relating to forest and landscape restoration we use a scoring technique (SBERT), to find the similarity between the search statement and sentences in a document, and a Topic Modelling technique (LDA), to pick out the parts in a document to create a better understanding of what kind of topics are addressed in these policies.
Analyzing the Policy Fragments with Sentence-BERT (SBERT)
To analyze all the available documents, and to identify which policies relate to Forest and Landscape Restoration, the documents are broken down into manageable parts and translated to one common language.
How can we compare different documents written in different languages and using specific words in each language?
The Multilingual Universal Sentence Encoder (MUSE) is one of the few algorithms specially designed to solve this problem. The model is simultaneously trained on a question answering task, (translation ranking task), and a natural language inference task (determining the logical relationship between two sentences). The translation task allows the model to map 16 languages (including Spanish and English) into a common space; this is a key feature that allowed us to apply it to our Spanish corpus.
The modules in this project are trained on the Spanish language, and due to the modular nature of the infrastructure this language can be easily switched back to the native language (English) in SBERT, subsequently, this project is working with a database of policy documents in Spanish but will work with any language base (Figure 2).
Analyzing the Policy Landscape
Collecting all available online policies, by web scraping, in a country can result in a database of thousands of documents, and millions of text fragments, all contributing to the policy landscape in the country or region.
When we are faced with thousands of potentially important documents, where do we start from?
We have several options to solve this problem, for example, we can select a couple of documents and start from there. Of course, we can read the abstract if one such exists, but in real life, we may not be that lucky.
Another approach is using the bag-of-words algorithm; this is a simple technique that counts the frequency of the words in a text, allowing to deduce the content of the text from the highest-ranking words. (In this project we used CountVectorizer from sklearn to get the document-term matrix), which can then be displayed in a word cloud (using Wordcloud), for an easy, one-look summary of the document, like the one below.
This way we can get a quick answer to the question “What is the document about?”.
However, faced with thousands of documents, it is impractical to do word clouds for them individually. This is where topic modeling comes in handy. Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation (LDA) is a popular algorithm for topic modeling.
The LDA model is a topic classification model developed by Prof. Andrew Ng et al. of Stanford University’s NLP Laboratory. It is a generative model for text and other forms of discrete data that generalizes and improves upon previous models of the past, such as Bayes, unigram, and N-gram models.
Here’s how it works: Consider a corpus that comprises a collection of M documents, and each document formed by a selection of words (w1 w2, …, wi, …, wn). Additionally, each word belongs to one of the topics in the collection of topics (z1, z2, …, zi, …, zk). By estimating machine-learning weighted parameters, the per-document topic distributions, the per-document word distributions, and the topic distribution for a document, we can calculate the probabilities to which certain words are associated with certain topics, characterizing the topics and word distributions. Then, we can generate a distribution of words for each topic.
The LDA package outputs models with different values of the number of topics (k), each giving a measure of topic coherence value, a rough guide of how good a given topic model is.
In this case, we picked up the one that gives the highest coherence value, without giving too many or too few topics, that would mean either not being granular enough, or difficult to interpret. ‘K’=12 marks the point of a rapid increase of topic coherence; a usual sign of meaningful and interpretable topics.
For each topic, we have a list of the highest-frequency words constructing the topic, and we can see some overarching themes appearing. Naming the topic is the next step, with the explicit caveat that setting the topic name is massively subjective, and the assistance of a subject matter expert is advisable. The knowledge of topics, and the keywords, is necessary because the topic should reflect the different aspects of the issues within the study or problem. For example, forest restoration can be seen as operating in the intersection of the following themes, defined by the LDA. Below is an example of a model with 12 topics, which happened to be the one with the most coherence, and the subjectively determined Topic Labels (Table 1).
We can see that one of the topics, “Forestry and Resources”, reflects closely the topics we are interested in, so the documents within it may be of particular relevance. The example document we saw before, “Sembrando Vida”, was assigned topic 8: “Development”, which is what it is expected from a document outlining the details of a broad incentive program. Some of the topics (e.g. Environmental, Agriculture) are related to the narrow topic of interest, whereas others (e.g. Food Production) are more on the periphery, and documents with this topic can be put aside for the time being. Thus topic modeling allows sifting the wheat from the chaff and zooming straight into more relevant documents.
The challenge of LDA is how to extract good quality topics that are clear, segregated, and meaningful. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics, as well as the subject knowledge. Being familiar with the context and themes, as well as with different types of documents, is essential for this. Followed up with data visualizations, and further processing, like comparison, identifying conflicts between ministries, change of theme over time, zooming into the document, etc.
The LDA process results in a table of topics defined by user-generated tags, and this table can be used to create a heat map (showing the frequency of the mentioning of a topic by country) and used for further evaluation of how, for example, the policies are differentiating between topics and regions; this process is illustrated in Figure 5.
Based on this, the following visualization is generated (Figure 5). The horizontal axis contains the different topic labels in Table 1, while the vertical axis lists three countries: Mexico, Peru, and Chile. The heat map gives us insights into the different levels of categorical policy present in the three countries; for instance, a territorial-related policy is widely prevalent in Mexico, but not adopted widely in Chile or Peru.
This allows policymakers to observe the decisions made by other countries and how it compares to their local administration, enabling them to make better-informed choices in domestic policy that are supported by data-driven evidence.
A valuable further development of topic analysis is to display policies (y) topics by the originator (ministries, etc.) to identify possible overlap and conflicts and to display change of topics in legislation and shifting focus over time. Going further into the documents, LDA can also be used to map out the topics in the different paragraphs, shifting the specific from generic information and identifying paragraphs of particular relevance. By zooming into specific documents, and then into specific document paragraphs, LDA is an efficient and flexible solution when faced with a huge volume of unclassified documents.
Conclusion: Topic Analysis for Policies
Finding needles in an online haystack is possible, especially with the help of the tools discussed, starting from a collection of web scraped documents, going through a data engineering process to clean up the found documents, and using the Latent Dirichlet Allocation (LDA) method to structure the documents, and fragments, by topics.
The data view by topic is a powerful way to see directly where what kind of policy is most dominant, and this information can be used to refine the search further or assist the policy-makers in defining the most efficient use of policies to create an environment where new policies are contributing to Forest and Landscape Restoration.
In the visualization space, possible enhancements include identifying overlaps and conflicts between government entities, highlighting the active policy areas, and displaying financial incentive information and projections.
In summary, the use of LDA is a promising way to navigate through complex environmental legislation systems and to retrieve relevant information from a vast compilation of legal text, from different sources, in multiple languages, and in quality.