Understanding Nature-Based Solutions through Natural Language Processing and Machine Learning

Understanding Nature-Based Solutions through Natural Language Processing and Machine Learning

Nature-based solutions (NbS) can help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes.

By Bala Priya, Simone Perazzoli, Nishrin Kachwala, Anju Mercian, Priya Krishnamoorthy, and Rosana de Oliveira Gomes

 

Why

NbS harness nature to tackle environmental challenges that affect human society, such as climate change, water insecurity, pollution, and declining food production. NbS can also help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes through, for example, growing trees in rural areas to boost crop yields and lock water in the soil. Although many organizations have investigated the effectiveness of NbS, these solutions have not yet been analyzed to the full extent of their potential.

In order to analyze such NbS approaches in greater detail, World Resources Institute (WRI) partnered with Omdena to better understand how regional and global NbS, such as forest and landscape restoration, can be leveraged to address and reduce climate change impacts across the globe.

In an attempt to identify and analyze such approaches, we investigated three main platforms that bring organizations together to promote initiatives that restore forests, farms, and other landscapes and enhance tree cover to improve human well-being:

 

Considering the aforementioned, the project goal is to assess the network of these three coalition websites through a systematic approach and to identify climate adaptation measures covered by these platforms and their partners.

The integral parts of the project’s workflow included building a scalable data collection pipeline to scrape the data from the platforms and partnering organizations, and several useful PDFs; leveraging several techniques from Natural Language Processing, such as building a Neural Machine Translation pipeline to translate non-English text to English, performing sentiment analysis for identifying potential gaps, experimenting with language models that were optimal for the given use cases, exploring various supervised and unsupervised topic modeling techniques to get meaningful insights and latent topics present in the voluminous text data collected, leveraging the novel Zero Shot Classification(ZSC) to identify the impacts and interventions, building a Knowledge-Based Question Answering(KBQA) system, and recommender system.

 

Project workflow

data science project workflow

 

Data collection

 

data collection

 

The platforms engaged in climate-risk mitigation were studied for several factors, including the climate risks in each region, as well as initiatives taken by the platforms and their partners, NbS employed for mitigating climate risks, the effectiveness of adaptations, goals, road map of the platform, among others. This information was gathered through:

a) Heavy scraping of Platform websites: It involved data scraping from all the platforms´ website pages using Python scripts. This process involved manual effort in customizing the scraping suitable for each page; accordingly, extending this model would involve some effort. Approximately 10MB of data was generated through this technique.

b) Light Scraping of Platform websites and partner organizations: it involved the obtention of the platforms sitemap. Once it was done, organization websites were crawled to obtain the text information. This method can be extended to other platforms with minimal effort. The volume of this data generated is around 21MB.

c) PDF Text Data Scraping of Platform and Other Sites: The platform websites presented several informative PDF documents (including reports and case studies), which were helpful for use in the downstream models, including the Q&A system, recommendation system, etc. This process was completely automated by the PDF text-scraping pipeline, which prepares a CSV of the PDF data and then generates a consolidated CSV file containing paragraph text from all the PDFs mentioned in the input CSV file. This pipeline can be incrementally used to generate the PDF text in batches. The NLP models utilized all of the PDF documents from the three platform websites, as well as some documents containing general information on NbS referred by WRI. Approximately 4MB of data was generated from the available PDFs.

 

Data preprocessing

a) Data Cleaning: an initial step comprising the removal of unnecessary text, text with length <50, as well as duplicates.

b) Language Detection and Translation: This step involved the development of a pipeline for language detection and translation to be applied to text data gathered from the 3 main sources described above.

Language Detection was performed by analyzing text using different deep learning pre-trained models such as langdetect and pycld3. Once the language is detected, the result is used as an input parameter for the translation pipeline. In this step, pre-trained multilingual models are downloaded from the Helsinki NLP repository available at HuggingFace.com (an NLP processing company). Text is tokenized and organized in batches to be sequentially fed into the pre-trained model. To enhance function performance, the pipeline was developed with GPU support, if available. Also, once a model is downloaded. it is cached into the program memory so it doesn’t need to be downloaded again.

The translation performed well with the majority of texts to which it was applied (most were in Spanish or Portuguese), being able to generate well-structured and coherent results, especially considering the scientific vocabulary of the original texts.

c) NLP preparation: This step was applied to the CSVs files generated through the scrap, after translation pipeline and was composed by Punctuation removal, Stemming, Lemmatization, Stop words removal, Part of Speech tagging (POS), Tagging, Chunking

 

Data modeling

Statistical analysis 

Statistical Analysis was performed on the preprocessed data by exploring the role of climate change impacts, interventions, and ecosystems involved in the three platforms’ portfolios using two different approaches: zero-shot classification and cosine similarity.

1.) Zero Shot Classification: This model assigns probabilities to which user-defined labels a text would best fit. We applied a zero-shot classification model from Hugging Face to classify descriptions for a given set of keywords belonging to climate-change impacts, interventions, and ecosystems for each of the three platforms. For ZSC, it combined the heavy-scraped datasets into one CSV for each website. The scores computed by ZSC can be interpreted as probabilities that the class belongs to a particular description. As a rule, we only considered as relevant those scores at or above 0.85.

Let’s consider the following example:

Description: The Greater Amman Municipality has developed a strategy called Green Amman to ban the destruction of forests The strategy focuses on the sustainable consumption of legally sourced wood products that come from sustainably managed forests The Municipality sponsors the development of sustainable forest management to provide long term social economic and environmental benefits Additional benefits include improving the environmental credentials of the municipality and consolidating the GAMs environmental leadership nationally as well as improving the quality of life and ecosystem services for future generations.

Model Predictions: The model assigned the following probabilities based upon the foregoing description:

  • Climate Change Impact Predictions: ‘loss of vegetation’: 0.89 , ‘deforestation’: 0.35, ‘GHG emissions’: 0.23, ‘rapid growth’ : 0.20, ‘loss of biodiversity’: 0.15 … (21 additional labels)
  • Types of Interventions Predictions: ‘management’: 0.92 , ‘protection’: 0.90 , ‘afforestation’: 0.65 , ‘enhance urban biodiversity’: 0.49, ‘Reforestation’: 0.38 … (16 additional labels)
  • Ecosystems: ‘Temperate forests’: 0.66, ‘Mediterranean shrubs and Forests’: 0.62, ‘Created forest’: 0.57, ‘Tropical and subtropical forests’: 0.55 … (13 additional labels)

 

For the description above, we see that the Climate-Change-Impact prediction is ‘loss of vegetation’, Types-of- Intervention prediction is ‘management’ or ‘protection’, and the Ecosystems prediction is empty.

2.) Cosine Similarity: Cosine Similarity compares vectors created by keywords, generated through Hugging Face models…. (how these keywords were computed) and descriptions, and scores the similarity in direction of these vectors. We then plot the scores with respect to technical and financial partners and a set of keywords. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations. This approach was useful to validate the results of the ZSC approach.

Aligning these results, it was possible to answers the following questions:

  • What are the climate hazards and climate impacts most frequently mentioned by the NbS platforms’ portfolios?
  • What percentage of interventions/initiatives take place in highly climate-vulnerable countries or areas?
  • What ecosystem/system features most prominently in the platforms when referencing climate impacts?

 

This model was applied on descriptions from all three heavy-scraped websites, and compared cross-referenced results (such as Climate Change Impact vs Intervention, or Climate Change Impact vs Ecosystems, or Ecosystems vs Intervention) for all three websites. Further, we created plots based on country and partners (technical and financial) for all three websites.

 

Sentiment analysis

Sentiment Analysis (SA) is the automatic generation of sentiment from text, utilizing both data mining and NLP. Here, SA is applied to identify potential gaps and solutions through the corpus text extracted from the three main platforms. In this Task, it implemented the following well consolidated unsupervised approaches: VADER, TextBlob, AFINN, FlairNLP, AdaptNLP Easy Sequence Classification. A new approach, Bert-clustering, was proposed by Omdena Team and it is based on Bert embedding of a positive/negative keywords list and computing distance(s) of these embedded descriptions to the corresponding cluster, were:

  • negative reference: words related to challenges and hazards, which give us a negative sentiment
  • positive reference: words related to NBS solutions, strategies, interventions, and adaptations outcomes, which give us a positive sentiment

For modeling purposes, the threshold values adopted are presented in table 1.

sentiment analysis

 

According to the scoring of the models presented in Table 2, AdaptNLP, Flair, and BERT/clustering approaches exhibited better performance compared to the lexicon-based models. Putting the limitations of unsupervised learning aside, BERT/clustering is a promising approach that could be improved for further scaling. SA can be a challenging task, since most algorithms for SA are trained on ordinary-language comments (such as from reviews and social media posts), while the corpus text from the platforms has a more specialized, technical, and formal vocabulary, which raises the need to develop a more personalized analysis, such as the BERT/clustering approach.

sentiment analysis

 

Across all organizations, it was observed that the content focuses on solutions rather than gaps. Overall, potential solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.

 

Topic modeling

Topic Modeling is a method for automatically finding topics from a collection of documents that best represent the information within the collection. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together. The algorithms/ techniques that were explored for the project include Top2Vec, SBERT: and Latent Dirichlet Allocation (LDA) with Gensim and Spacy.

  • Top2Vec: For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
World Cloud

Word cloud generated when a search was performed for the word “deforestation”.

 

  • S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
  • LDA: In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution. A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, I20x20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more. One can further group the URLs by color grouping and analyze the data in greater depth. A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.

 

Other NLP/ML techniques

Besides the aforementioned above, other techniques were also exploited in this project and will be presented in further articles, such as

Network Analysis presents interconnections among the platform, partners, and connected websites. A custom network crawler was created, along with heuristics such as prioritizing NbS organizations over commercial linkages (this can be tuned) and parsing approx. 700 organization links per site (this is another tunable parameter). We then ran the script with different combinations of source nodes (usually the bigger organizations like AFR100, INITIATIVE20x20 were selected as sources to achieve the required depth in the network). Based on these experiments, we derived a master set of irrelevant sites (such as social media, advertisements, site-protection providers, etc.) that are not crawled by our software.

Knowledge Graphs represent the information extracted from the text in the websites based on the relationships between them. A pipeline was built to extract the triplets based upon the subject / object relationship using StanfordNLP’ss OpenIE on the paragraph. Subjects and objects are represented by nodes, and relations by the paths (or “edges”) between them.

Recommendation Systems: The recommender systems application is built based upon the information extracted from the partner’s websites, with a goal to provide recommendations of possible solutions already available and implemented within the network of partners from WRI. The application allows a user to search for similarities across organizations (collaborative filtering) as well as similarities in the content of the solution (content-based filtering).

Question & Answer System: Our knowledge-based Question & Answer system answers questions in the domain context of the text scraped data from the PDF documents from the main platform websites, as well as a few domain-related PDF documents which contain the climate risks and NbS information, as well as the light-scraped data obtained from the platforms and their partner websites.

The KQnA system is based on Facebook’s Deep Passage Retrieval method which provides better context by generating vector embeddings. The RAG neural network(RAG) generates a specific answer for a given question conditioned on the retrieved documents. RAG gives the most of an answer from the shortlisted documents. The KQnA system is built on the open-source Deepset.ai Haystack framework and hosted on a virtual machine, accessible via REST API to the Streamlit UI.

The platform websites have many PDF documents containing extensive significant information that would take a lot of time for humans to process. The Q&A system is not a replacement for human study or analysis but helps ease such efforts by obtaining the preliminary information, linking the reader to the specific documents which have the most relevant answers. The same method was extended to light-scraped data, broadly covering the platform websites and their partner websites.

The PDF and light-scraped documents are stored on two different indices on Elasticsearch to run the query on the streams separately. Deep Passage Retrieval is laid on the Elasticsearch Retriever for contextual search, providing better answers. Filters of Elasticsearch can be applied on the platform/URL for the focused search on a particular platform or website. Elastic search 7.6.2 is installed on VM which is compatible with Deepset.ai Haystack. RAG is applied to the generated answers to get a specific answer. Climate risks, NbS solutions, local factors, and investment opportunities are queried on PDF data and Platform data. Questions on the platform for PDF data, URL for light scraped data can be performed for localized search.

 

Insights

By developing decision-support models and tools, we hope to make the NbS platforms’ climate change-related knowledge useful and accessible for partners of the initiative, including governments, civil society organizations, and investors at the local, regional, and national levels.

Any of these resources can be augmented with additional platform data, which would require customizing the data gathering effort per website. WRI could extend the keywords used in statistical analysis for hazards, the types of interventions, the types of ecosystems, and create guided models to gain further insights.

 

Data gathering pipeline

We have provided very useful utilities to collect and aggregate data and PDF content from websites. WRI can extend the web-scraping utility from the leading platforms and their partners to other platforms with some customization and minimal effort. Using the PDF utility, WRI can retrieve texts from any PDF files. The pre-trained multilingual model in the translation utility can translate the texts from various sources to any language.

 

Statistical analysis

Using zero-shot classification, predictions were made for the keywords that highlight Climate Hazards, Types of Interventions, and Ecosystems, based upon a selected threshold. Cosine similarity predicts the similarity of a document with regard to the keywords. Heat maps visualize both of these approaches. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations.

 

Sentiment analysis

SA identifies potential gaps from negative connotations derived from words related to challenges and hazards. A tree diagram visualizes the sentiment analysis for publications/partners/projects documents from each platform. Across all organizations, the content focuses on solutions rather than gaps. Overall, solutions and possible solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.

 

Topic Models

Topic models are useful for identifying the main topics in documents. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together.

  • With semantic search with Top2Vec. For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
  • S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
  • In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution.
  • A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku.
  • The distance in the 3D space among points represents the closeness of keywords/topics in the URL
  • The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more.
  • One can further group the URLs by color grouping and analyze the data in greater depth.
  • A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.

 

This work has been part of a project with World Resources Insitute.

Topic Analysis to Identify and Classify Environmental Policies in LATAM

Topic Analysis to Identify and Classify Environmental Policies in LATAM

By Gijs van den Dool, Galina Naydenova, and Ann Chia

 

In an 8-week project, 50 technology changemakers from Omdena embarked on a mission to find needles in an online haystack. The project proved that using Natural Language Processing (NLP) can be very efficient to point to where these needles are hiding, especially when there are (legal) language barriers, and different interpretations between countries, regions, governmental institutes.

 

 

Introduction

The World Resource Institute (WRI) identified the problem and asked Omdena to help solve it. The project was hosted on Omdena´s platform to create a better understanding of the current situation regarding enabling policies through NLP techniques like topic analysis. Policies are one of the tools decision-makers can use to improve the environment, but often it is not known which policies and incentives are in place, and which department is responsible for the implementation.

Understanding the effect of the policies involves reading and topic analysis of thousands of pages of documentation (legislation) across multiple sectors. It is precisely in this area where Natural Language Processing (NLP) can help, and assist, in the processing of policy documents, highlighting the essential documents and parts of documents, and identifying which areas are under/over-represented. A process like this will also promote the knowledge sharing between stakeholders, and enable rapid identification of incentives, disincentives, perverse incentives, and misalignment between policies.

 

Problem Statement

This project aimed to identify economic incentives for forest and landscape restoration using an automated approach, helping (for a start) policymakers in Mexico, Peru, Chile, Guatemala, and El Salvador to make data-driven choices that positively shape their environment.

The project focused on three objectives:

  • Identifying which policies relate to forest and landscape restoration using topic analysis
  • Detecting the financial and economic incentives in the policies via topic analysis
  • Creating visualization which clearly shows the relevance of policies to forest and landscape restoration

This was achieved through the following pipeline, demonstrated through Figure 1 below:

 

Figure 1: NLP Pipeline

 

The Natural Language Processing (NLP) Pipeline

The web scraping process consisted of two approaches: the scraping of official policy databases, and Google Scraping. This allowed the retrieval of virtually all official policy documents from the five listed countries roughly between 2016 and 2020. The scraping results were then filtered further by relevance to landscape restoration, and the final text metadata of each entry was then stored on PySQL. Thus, we were able to build a comprehensive database of policy documents for use further down the pipeline.

Text preprocessing converted the retrieved documents from a human-readable form to a computer-readable form. Namely, policy documents were converted from pdf to txt, with text contents tokenized, lemmatized, and further processed for use in the subsequent NLP models.

NLP modeling involved the use of Sentence-BERT (SBERT) and LDA topic analysis. SBERT was used to build a search engine that parses policy documents and highlights relevant text segments that match the given input search query. The LDA model was used for topic analysis, which will be the focus of this economic policies analysis article.

Finally, the web scraping results, SBERT search engine, and in the future, the LDA model outputs would be combined and the results presented into an interactive web app, allowing greater accessibility to the non-technical audience.

 

 

Applications for Natural Language Processing

All countries are creating policies, plans, or incentives, to manage land use and the environment and are part of the decision making process. Governments are responsible for controlling the effects of human activities on the environment, particularly those measures that are designed to prevent or reduce harmful effects of human activities on ecosystems, and do not have an unacceptable impact on humans. This policy-making can result in the creation of thousands of documents. The idea is to extract the economic incentives for forest and landscape restoration from the available (online) policy documents to get a better understanding of what kind of topics are addressed in these policies via topic analysis.

We developed a two-step approach to solving this problem: the first step selects the documents that are most closely related to reforestation in a general sense, and the second step points out the segments of those documents stating economic incentives. To mark which policies are relating to forest and landscape restoration we use a scoring technique (SBERT), to find the similarity between the search statement and sentences in a document, and a Topic Modelling technique (LDA), to pick out the parts in a document to create a better understanding of what kind of topics are addressed in these policies.

 

 

Analyzing the Policy Fragments with Sentence-BERT (SBERT)

To analyze all the available documents, and to identify which policies relate to Forest and Landscape Restoration, the documents are broken down into manageable parts and translated to one common language.

How can we compare different documents written in different languages and using specific words in each language?

The Multilingual Universal Sentence Encoder (MUSE) is one of the few algorithms specially designed to solve this problem. The model is simultaneously trained on a question answering task, (translation ranking task), and a natural language inference task (determining the logical relationship between two sentences). The translation task allows the model to map 16 languages (including Spanish and English) into a common space; this is a key feature that allowed us to apply it to our Spanish corpus.

The modules in this project are trained on the Spanish language, and due to the modular nature of the infrastructure this language can be easily switched back to the native language (English) in SBERT, subsequently, this project is working with a database of policy documents in Spanish but will work with any language base (Figure 2).

 

 

Figure 2: Visualisation of SBERT model, in Spanish.

 

 

Analyzing the Policy Landscape

Collecting all available online policies, by web scraping, in a country can result in a database of thousands of documents, and millions of text fragments, all contributing to the policy landscape in the country or region.

When we are faced with thousands of potentially important documents, where do we start from?

We have several options to solve this problem, for example, we can select a couple of documents and start from there. Of course, we can read the abstract if one such exists, but in real life, we may not be that lucky.

Another approach is using the bag-of-words algorithm; this is a simple technique that counts the frequency of the words in a text, allowing to deduce the content of the text from the highest-ranking words. (In this project we used CountVectorizer from sklearn to get the document-term matrix), which can then be displayed in a word cloud (using Wordcloud), for an easy, one-look summary of the document, like the one below.

This way we can get a quick answer to the question “What is the document about?”.

However, faced with thousands of documents, it is impractical to do word clouds for them individually. This is where topic modeling comes in handy. Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation (LDA) is a popular algorithm for topic modeling.

The LDA model is a topic classification model developed by Prof. Andrew Ng et al. of Stanford University’s NLP Laboratory. It is a generative model for text and other forms of discrete data that generalizes and improves upon previous models of the past, such as Bayes, unigram, and N-gram models.

Here’s how it works: Consider a corpus that comprises a collection of M documents, and each document formed by a selection of words (w1 w2, …, wi, …, wn). Additionally, each word belongs to one of the topics in the collection of topics (z1, z2, …, zi, …, zk). By estimating machine-learning weighted parameters, the per-document topic distributions, the per-document word distributions, and the topic distribution for a document, we can calculate the probabilities to which certain words are associated with certain topics, characterizing the topics and word distributions. Then, we can generate a distribution of words for each topic.

The LDA package outputs models with different values of the number of topics (k), each giving a measure of topic coherence value, a rough guide of how good a given topic model is.

 

Figure 4: Coherence score vs. Number of topics

 

In this case, we picked up the one that gives the highest coherence value, without giving too many or too few topics, that would mean either not being granular enough, or difficult to interpret. ‘K’=12 marks the point of a rapid increase of topic coherence; a usual sign of meaningful and interpretable topics.

For each topic, we have a list of the highest-frequency words constructing the topic, and we can see some overarching themes appearing. Naming the topic is the next step, with the explicit caveat that setting the topic name is massively subjective, and the assistance of a subject matter expert is advisable. The knowledge of topics, and the keywords, is necessary because the topic should reflect the different aspects of the issues within the study or problem. For example, forest restoration can be seen as operating in the intersection of the following themes, defined by the LDA. Below is an example of a model with 12 topics, which happened to be the one with the most coherence, and the subjectively determined Topic Labels (Table 1).

 

Table 1. Topic labels (12) and their respective keywords in the selected LDA model

 

We can see that one of the topics, “Forestry and Resources”, reflects closely the topics we are interested in, so the documents within it may be of particular relevance. The example document we saw before, “Sembrando Vida”, was assigned topic 8: “Development”, which is what it is expected from a document outlining the details of a broad incentive program. Some of the topics (e.g. Environmental, Agriculture) are related to the narrow topic of interest, whereas others (e.g. Food Production) are more on the periphery, and documents with this topic can be put aside for the time being. Thus topic modeling allows sifting the wheat from the chaff and zooming straight into more relevant documents.

The challenge of LDA is how to extract good quality topics that are clear, segregated, and meaningful. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics, as well as the subject knowledge. Being familiar with the context and themes, as well as with different types of documents, is essential for this. Followed up with data visualizations, and further processing, like comparison, identifying conflicts between ministries, change of theme over time, zooming into the document, etc.

 

 
 

Results

The LDA process results in a table of topics defined by user-generated tags, and this table can be used to create a heat map (showing the frequency of the mentioning of a topic by country) and used for further evaluation of how, for example, the policies are differentiating between topics and regions; this process is illustrated in Figure 5.

 

 

Figure 5: LDA model visualization

 

 

Heat maps

Based on this, the following visualization is generated (Figure 5). The horizontal axis contains the different topic labels in Table 1, while the vertical axis lists three countries: Mexico, Peru, and Chile. The heat map gives us insights into the different levels of categorical policy present in the three countries; for instance, a territorial-related policy is widely prevalent in Mexico, but not adopted widely in Chile or Peru.

This allows policymakers to observe the decisions made by other countries and how it compares to their local administration, enabling them to make better-informed choices in domestic policy that are supported by data-driven evidence.

 

Figure 6: Heatmap displaying the frequency of appearance of LDA-defined policy topics by country

 

 

Next Steps

A valuable further development of topic analysis is to display policies (y) topics by the originator (ministries, etc.) to identify possible overlap and conflicts and to display change of topics in legislation and shifting focus over time. Going further into the documents, LDA can also be used to map out the topics in the different paragraphs, shifting the specific from generic information and identifying paragraphs of particular relevance. By zooming into specific documents, and then into specific document paragraphs, LDA is an efficient and flexible solution when faced with a huge volume of unclassified documents.

 

 

Conclusion: Topic Analysis for Policies

Finding needles in an online haystack is possible, especially with the help of the tools discussed, starting from a collection of web scraped documents, going through a data engineering process to clean up the found documents, and using the Latent Dirichlet Allocation (LDA) method to structure the documents, and fragments, by topics.

The data view by topic is a powerful way to see directly where what kind of policy is most dominant, and this information can be used to refine the search further or assist the policy-makers in defining the most efficient use of policies to create an environment where new policies are contributing to Forest and Landscape Restoration.

In the visualization space, possible enhancements include identifying overlaps and conflicts between government entities, highlighting the active policy areas, and displaying financial incentive information and projections.

In summary, the use of LDA is a promising way to navigate through complex environmental legislation systems and to retrieve relevant information from a vast compilation of legal text, from different sources, in multiple languages, and in quality.

 
Overcoming Data Challenges through the Power of Diverse & Collaborative Teams

Overcoming Data Challenges through the Power of Diverse & Collaborative Teams

In this demo day, we talked about the inevitable data challenges/roadblocks that come up in real-world AI projects. The insights shared came from our experiences with more than 20 AI projects, working with partners including the UN Refugee Agency (UNHCR), the World Resources Institute, the World Energy Council, and numerous NGOs and corporations.

Omdena is a collaborative platform to build innovative, ethical, and efficient AI solutions to real-world problems. Since our founding in May 2019, over 1250 AI experts from more than 80 countries have come together on Omdena projects to address significant issues related to hunger, sexual harassment, land conflicts, gang violence, wildfire prevention, and energy poverty.

We’ve seen that the way that we approach AI development, via bottom-up collaboration with diverse team members, fosters innovation and creativity which leads to the breakdown of data roadblocks. Innovation is inherent in the Omdena process.

We shared three Omdena projects to act as case studies for these innovative approaches to tackling data challenges.

 

Data Roadblock 1: Incomplete Data Sets

In the real world, datasets are rarely complete. We find having large teams of dozens of people means that data gathering, cleaning, and wrangling happen at a phenomenal speed. And by taking a bottom-up approach, we have multiple sub-teams looking at data problems from different angles, allowing for innovative approaches to be explored.

In the following case study, the Omdena team worked out ways to identify safe routes in a city in the aftermath of an earthquake, where the relevant data sets were inconsistent and unreliable.

 

Case Study : Disaster Response: Improving the Aftermath Management of an Earthquake

In collaboration with Istanbul’s Impact Hub innovation center, Omdena data scientists combined satellite imagery of Istanbul with street map data in order to build a tool that facilitates family reunification by indicating the shortest and safest route between two points after an earthquake.

“Omdena´s approach to AI development is by far the best that I have seen in 2019” — Semih Boyaci, Co-Founder Impact Hub Istanbul

You can learn more about this project here:

 

 

Data Roadblock 2: No Data

We don’t see the lack of data as a showstopper. On those projects without data, the team starts by asking what do we need to know to address the problem? Where might that data live? If it doesn’t exist, how can we create it from something that does exist? Here the diversity of the team members is very powerful.

We’ve seen time and again the impact of bringing together people with vastly different professional and life experiences. Our teams are typically 30% or more female. On any project, we’ll have on average 14 countries represented. Our collaborators range in age from 17 to 65. Not only does this diversity lead to ethical and trusted solutions, but it also fosters creativity and alternative ideas about what data is relevant and where to find it.

In the following project, we looked at how to assess post-traumatic stress disorder among those that have suffered trauma in low-resource environments. In this case, the team started with no data in-hand.

 

Case Study : Building a chatbot for Post-traumatic-stress-disorder (PTSD) assessment

32 Omdena collaborators developed a machine learning-driven chatbot for PTSD assessment in war and refugee zones.

 

The unique aspect of the project was that we did not start with a data set.

Through the collaborative efforts of the project community, the team identified and annotated suitable patient data. The teams applied linear classifiers for Natural Language Processing (NLP) for PTSD risk assessment and transfer learning for data augmentation.

You can learn more about this project here:

 

Data Roadblock 3: Disparate Data Sources

Relevant data doesn’t typically come packaged in just one form. We often need to meld disparate data sources to get at a solution. Through collaboration, sub-teams focused on separate data and AI techniques come together to integrate those efforts to derive insights about the problem.

In the following project, the goal was to uncover domestic violence in India hidden due to COVID lockdowns. Among the many challenges the team addressed was the integration of data culled from disparate sources.

 

Case Study : Analyzing Domestic Violence through Natural Language Processing

This project was done with the award-winning Red Dot Foundation. Within Omdena’s collaborative platform, the team looked craft a dataset to reveal domestic violence and online harassment patterns in India during COVID-19 lockdowns. The AI experts scrapped data from news articles as well as social media to apply various natural language processing (NLP) techniques such as topic modeling, document annotations, and stacking machine learning models.

 

 

You can learn more about this and related projects here:

 

 

 

More about Omdena

Omdena is the collaborative platform to build innovative, ethical, and efficient AI and Data Science solutions to real-world problems. 

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

By Laura Clark Murray, Joanne Burke, and Rishika Rupam

 

A team of AI experts and data scientists from 12 countries on 4 continents worked collaboratively with the World Resources Institute (WRI) to support efforts to resolve land conflicts and prevent land degradation.

The Problem: Land conflicts get in the way of land restoration

Among its many initiatives, WRI, a global research organization, is leading the way on land restoration — restoring land that has lost its natural productivity and is considered degraded. According to WRI, land degradation reduces the productivity of land, threatening the economy and people’s livelihoods. This can lead to reduced availability of food, water, and energy, and contribute to climate change.

Restoration can return vitality to the land, making it safe for humans, wildlife, and plant communities. While significant restoration efforts are underway around the world, local conflicts get in the way. According to John Brandt of WRI, “Land conflict, especially conflict over land tenure, is a really large barrier to the work that we do around implementing a sustainable land use agenda. Without having clear tenure or ownership of land, long-term solutions, such as forest and landscape restoration, often are not economically viable.”

 

Photo credit: India’s Ministry of Environment, Forest and Climate Change

Photo credit: India’s Ministry of Environment, Forest and Climate Change

 

And though governments have instituted policies to deal with land conflicts, knowing where conflicts are underway and how each might be addressed is not a simple task. Says Brandt, “Getting data on where these land conflicts, land degradation, and land grabs occur is often very difficult because they tend to happen in remote areas with very strong language barriers and strong barriers around scale. Events occur in a very distributed manner.” WRI turned to Omdena to use AI and natural language processing techniques to tackle this problem.

 

The Project Goal: Identify news articles about land conflicts and match them to relevant government policies

 

Impact

“We’re very excited that the results from this partnership were very accurate and very useful to us.

We’re currently scaling up the results to develop sub-national indices of environmental conflict for both Brazil and Indonesia, as well as validating the results in India with data collected in the field by our partner organizations. This data can help supply chain professionals mitigate risk in regards to product-sourcing. The data can also help policymakers who are engaged in active management to think about what works and where those things work.” — John Brandt, World Resources Institute.

 

The Use Case: Land Conflicts in India

In India, the government has committed 26 million hectares of land for restoration by the year 2030. India is home to a population of 1.35 billion people, has 28 states, 22 languages, and more than 1000 dialects. In a land as vast and varied as India, gathering and collating information about land conflicts is a monumental task.

The team looked to news stories, with a collection of 65,000 articles from India for the years 2017–2018, extracted by WRI from GDELT, the Global Database of Events Language and Tone Project.

 

Identifying news articles about land conflicts

Land conflicts around land ownership include those between the government and the public, as well as personal conflicts between landowners. Other types of conflicts include those between humans and animals, such as humans invading habitats of tigers, leopards, or elephants, and environmental conflicts, such as floods, droughts, and cyclones.

 

 

The team used natural language processing (NLP) techniques to classify each news article in the 65,000 article collection as pertaining to land conflict or not. While this problem can be tackled without the use of any automation tools, it would take human beings years to go through each article and study it, whereas, with the right machine or deep learning model, it would take mere seconds.

A subset of 1,600 newspaper articles from the collection was hand-labeled as “positive” or “negative”, to act as an example of proper classification, or example of proper classification. For example, an article about a tiger attack would be hand-labeled as “positive”, while an article about local elections would be labeled as “negative”.

To prepare the remaining 63,400 articles for an AI pipeline, each article was pre-processed to remove stop words, such as “the” and “in”, and to lemmatize words to return them to their root form. Co-referencing pre-processing was used to increase accuracy. A topic modeling approach was used to further categorize the “positive” articles by the type of conflict, such as Land, Forest, Wildlife, Drought, Farming, Mining, Water. With refinement, the classification model achieved an accuracy of 97%.

 

 

With the subset of land conflict articles successfully identified, NLP models were built to identify four key components within each article: actors, quantities, events, and locations. To train the model, the team hand-labeled 147 articles with these components. Using an approach called Named Entity Recognition, the model processed the database of “positive” articles to flag these four components.

 

 

 

Matching land conflict articles to government policies

Numerous government policies exist to deal with land conflicts in India. The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.

 

 

A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

The Omdena team built a visual dashboard to display the land conflict events and the matching government policies. In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.

 

 

Underlying this dashboard are the NLP models that classify news articles related to land conflict, and land degradation, and match them to the appropriate government policy.

 

 

The results of this pilot project have been used by the World Resources Institute to inform their next stage of development.

Join one of our upcoming demo days to see the power of Collaborative AI in action.

Want to watch the full demo day?

Check out the entire recording (including a live demonstration of the tool).

 

NLP Clustering to Understand Social Barriers Towards Energy Transition | World Energy Council

NLP Clustering to Understand Social Barriers Towards Energy Transition | World Energy Council

Using NLP clustering to better understand the thoughts, concerns, and sentiments of citizens in the USA, UK, Nigeria, and India about energy transition and decarbonization of their economies. The following article shares observatory results on how citizens of the world perceive their role within the energy transition. This includes associated social risks, opportunities, and costs.

The findings are part of a two-month Omdena AI project with the World Energy Council (WEC). None of the findings are conclusive but observative taking into account the complexity of the analysis scope.

 

The Project Goal

The aim was to find information that can help governments to effectively involve people in the accelerating energy transition. The problem was quite complicated and there was no data provided to us. Therefore, we were supposed to create our own data-set, analyze it, and provide WEC with insights. We started with a long list of open questions such as:

  • What should our output look like?
  • What search terms would be useful to scrape data for?
  • What countries should be considered as our main focus?
  • Should we consider non-English languages as well and analyze them?
  • How much data per country will be enough?
  • Etc.

In order to meet the deadline for the project, we decided to go with the English language only and come up with good working models.

 

The Solution

 

Getting data from Social Media

We scraped the following resources: Twitter, YouTube, Facebook, Reddit, and famous newspapers specific to each country. Desired insights should cover developed, developing, and under-developed countries and the emphasis was specifically on developing, and under-developed countries.

The results discussed in this article obtained from scraped tweet data and for USA, UK, India, and Nigeria which cover the three categories of developed, developing, and under-developed countries.

 

Our Approach: Trying different NLP techniques

We first gathered data by scraping tweets using several specific keywords we found to be important for specific countries using google trends. I added stop-words, stemming, removed hashtags, punctuation, numbers, mentions, and replaced URLs with _URL. I used TF-IDF vectorization for feature extraction of the articles. I am going to walk you through various steps taken to tackle the problem.

 

Approach 1: Sentiment Analysis (Non-satisfactory)

Sentiment analysis of short tweets data comes with its own challenges and some of the important challenges we were facing for this project were:

  • Tags mean different things in different countries. #nolight can be Canadians complaining about the winter sunset, or Nigerians having a power cut.
  • Tags take a side. For example, #renewables is pro-green and #climatehoax is not. So positive sentiment on #renewables might not really tell us much.
  •  The classifier model built on #climatechange and related tags do not work at all on the anti-green tags such as #climatemyth.
  • Some anti-green tweets are full of happy emojis which makes the sentiments unreliable.
  • The major tweeting countries are overwhelmingly positive. In fact, the distribution of climate change-related tweets across the world is not uniform and the number of tweets across some countries is much more prevalent in the data-set as compared to others (Figure1) [1].
  • The interpretation of outputs. In fact, by just assigning labels to each tweet we will not be able to derive insights on the barriers to the energy transition. Therefore, the interpretability of the model is very important.

Considering all the challenges discussed, the sentiment analysis of the tweets did not produce satisfactory results (Table1) and we decided to test other models.

 

 

Number of climate change related tweets per country [1]

Figure1: Number of climate change related tweets per country [1]

 

 

Classifier accuracy for sentiment analysis of tweets data (USA)

Table1: Classifier accuracy for sentiment analysis of tweets data (USA)

 

 

Approach 2: Topic Modeling (Unsatisfactory) 

Topic modeling is an NLP technique that provides a way to compare the strength of different topics and tells us which topic is much more informative as compared to others. Topic models are unsupervised models with no need for data labeling. Because tweets are short it was really hard to differentiate between different topics and also correspond them to a specific topic using models such as LDA. Topic models tend to produce the best results when applied to texts that are not too short and those that have a consistent structure.

 

1. Using a semi-supervised approach

We chose a semi-supervised topic modeling approach (CorEX) [2]. Since the data was very high dimensional, we applied dimensionality reduction in order to remove noise and interpret the data. Permutation Test is used to determine the optimum number of principal components required for PCA [3,4]. From the explained variance ratio plot, it appeared that the cumulative explained variance line is not perfectly linear, but it is very close to a straight line.

Through permutation tests, I noticed that the mean of the explained variance ratio of permuted matrices did not really differ from the explained variance ratio of the non-permuted matrix which suggested that applying PCA on correlated topic model’s results were not helpful at all.

 

 

 

 

This means each of the principal components contributes to the variance explanation almost equally, and there’s not much point in reducing the dimensions based on PCA.

 

2. Identifying 20 important topics

The CorEx results showed that there are about 20 important topics and it was also showing the important words per topic. But how to interpret the results?

Data was very high dimensional and dimensionality reduction was not helpful at all. For example, if price, electricity, ticket, fuel, gas, and skepticism are the most important words for one topic how to understand the concerns of the people of that country? Is it fuel price that is of concern to them? Or electricity prices, or ticket prices? There could be a combination of many different possibly related words in each topic and by just looking at the important words in each topic, it would not be possible to find out what is the story behind data to harness clean energy for a better future.

Besides, bigrams or trigrams with topic models did not help much either because not the main keywords conveying the main focus of the tweet might always appear together.

 

 

 

 

Approach 3: Clustering (Kmeans & Hierarchical)

Both Kmeans and Hierarchical clustering models lead to comparable results illustrating separate clear clusters. Because both models have comparable performance, we derived all results using Hierarchical clustering which better shows the hierarchy of the clusters. Tweet data were collected for four different countries as discussed before and the model was applied to the data of each country separately to analyze the results. To summarize we only show the clustering results for India. But all the insights across countries are shown at the end of the article.

 

 

 

 

Hierarchical Clustering Results

After finding clear clusters from the data, the next step was interpreting the data by creating meaningful visualizations and insights. A combination of Scattertext, co-occurrence graph, dispersion plot, colocated word clouds, and top trigrams resulted in very useful insights from data to harness clean energy for a better future.

An important lesson to point out here is to always rely on a combination of various plots for your interpretations instead of only one. Each type of plot helps us visualize one aspect of data and combining various plots together helps to create a comprehensive clear picture from data.

 

 

1. Using Scattertext

Scattertext is an excellent exploratory text analysis tool that allows cool visualizations differentiating between the terms used by different documents using an interactive scatter plot.

Two types of plots were created which was very helpful in interpreting the results.

1) Visualizing word embedding projections. This has been explored using word association with a specific keyword. The keywords include the following: [Access, Availability, Affordability, Bills, Prices]. If the reader is interested, they can try more keywords using the provided code in this study.

2) In another plot, the uni-grams from the clustered tweets are selected and plotted using their dense-ranked category-specific frequencies. We used this difference in dense ranks as the scoring function.

All the interactive plots are stored in an HTML file and are available in the GitHub repository. If you click on the interactive version, the list of tweets with each specific term can be explored. Please note that first hierarchical clustering is applied to the data and then the clustered tweets are given to Scattertext as input. You can gain further information by diving deep into these plots. The data used for creating these results can be found here and the notebook to apply to cluster and create these scatter plots can be found here.

The following shows the interactive versions of all plots for various countries:

 

1.1. Rank and frequencies across different categories (India)

 

 

 An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 2. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 2. See Cluster 1 vs 2 for an interactive version of this plot.

Figure 8. An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 2. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 2. See Cluster 1 vs 2 for an interactive version of this plot.

 

 

An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 3. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 3. See Cluster 1 vs 3 for an interactive version of this plot.

Figure 9. An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 3. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 3. See Cluster 1 vs 3 for an interactive version of this plot.

 

 

1.2. Word embedding projection plots using Scattertext (India)

 

 

An example Scattertext plot showing word associations to term prices using Spacy’s pretrained embedding vectors. This is used to see the terms most associated with the term prices. At the top right corner, we see the most commonly associated words with the term prices such as electricity. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Bills for an interactive version of this plot.

Figure 10. An example Scattertext plot showing word associations to term prices using Spacy’s pre-trained embedding vectors. This is used to see the terms most associated with the term prices. At the top right corner, we see the most commonly associated words with the term prices such as electricity. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Bills for an interactive version of this plot.

 

 

 An example Scattertext plot showing word associations to term bills using Spacy’s pretrained embedding vectors. This is used to see the terms most associated with the term bills. At the top right corner, we see the most commonly associated words with the term bills such as electricity, prices, energy, power. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Prices for an interactive version of this plot.

Figure 11. An example Scattertext plot showing word associations to term bills using Spacy’s pretrained embedding vectors. This is used to see the terms most associated with the term bills. At the top right corner, we see the most commonly associated words with the term bills such as electricity, prices, energy, power. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Prices for an interactive version of this plot.

 

 

2. Twitter Insights (Price & Energy Transition Concerns)

 

2.1. India
  • Solar and wind don’t necessarily mean cheaper prices as it did not cause so in Germany. When Germany went all on renewables, energy prices and carbon emissions went up.
  • The electrical prices can drop for people who are sourcing power from the government-owned renewable sources because the prices are not going to vary with oil and natural gas.
  • Renewable energy policy can lead to much lower electricity prices, a stronger globally competitive economy, less import of fossil fuels, and as a result less pollution.
  • Putting a tax on coal and making open access a reality are two potential action areas to make renewable energy affordable.
  • Let oil prices increase and subsidies stop.
  • Many requests to replace fossil fuels with cleaner fossil fuels such as stubbles from farmers.
  • Cut oil imports and encourage renewable energies.
  • A lot of complaints regarding electricity shortage, lack of electricity for hours or days, electricity cut, electricity, and water supply.
  • Fossil fuels are dirty, and Nuclear power is dangerous. Therefore, we need to make renewable energy work or harness clean energy for a better future.

 

2.2. Nigeria
  • People complaining about no constant electricity, and zero business-friendly policy.
  • Enhancing the delivery of electricity in the country.
  • Whenever it rained electricity supply was cut off for days, lack of electricity every weekend daily and overnight, and unstable electricity.
  • No water and no electricity.
  • The electricity sector is the third main consuming sector of oil.
  • Lots of worries and trouble regarding paying electricity bills.
  • Access to electricity is not for everyone.
  • Access to affordable sustainable renewable energy.
  • Renewable energy water and waste management are some of Nigeria’s major partnership areas with Ghana.
  • Harnessing tidal or offshore wind energy which is a clean and renewable source.
  • Lots of positive experiences and low prices with the usage of Solar power systems.

 

2.3. UK

  • Bringing down the prices of electricity and gas.
  • Having stable prices for electricity.
  • People prefer higher prices for gas than electricity.
  • Need to think beyond electricity to affect the energy transition.
  • Renewables disrupt the electricity market and politicians raising electricity prices to tackle climate emergency problems is an awful policy.
  • A lot of requests on investment in Renewable Energies.
  • The transition to renewable is being too slow.
  • Lots of discussions on whether it is good to replace the nuclear stations with renewables.
  • Whether the zero-carbon economy has any economic benefit for the UK.

 

2.4. USA

  • Slowing down climate change.
  • Market-based solutions for climate change.
  • Renewable energy infrastructure is lame and unreliable.
  • Renewables increase electricity prices and distort energy markets with favorable purchase agreements.
  • Many complaints regarding gas prices.
  • National security’s priority should be on renewable energy Investing in its infrastructure and jobs progs.
  • Figure out how to store renewable energy and get rid of excess CO in the atmosphere.
  • Renewable energy represents a significant economic opportunity.

 

 

3. Weighing a word´s importance via Dispersion Plot

A word’s importance can be weighed by its dispersion in a corpus. Lexical dispersion is a measure of a word’s homogeneity across the parts of a corpus. The following plot notes how many times a word occurs throughout the entire corpus for different countries including India, Nigeria, UK, and the USA.

According to the following dispersion plot, access to electricity is an important concern for Nigeria while this is not the case for the other three countries. How do we know that this access is related to electricity? Well, the answer is Scattertext plots shown in the previous section. Analyzing those plots together with the dispersion plot shows that the concern is electricity access.

Access to affordable renewable energy is a big concern in Nigeria and then India, while the affordability of renewable energy is not a problem for people in the UK and the USA. Affordability is a big concern for the people in Nigeria and people have difficulty paying their electricity bills.

Energy, electricity, power, and renewables are also the topic of most of the discussions in all of these countries. But what aspects of each topic are of concern to each country? The answer is given in the previous section where we interpret the results of Scattertext plots.

 

 

Lexical dispersion for various keywords across different countries

Figure 12. Lexical dispersion for various keywords across different countries

 

 

4. Top Trigrams for Different Countries

 

 

Top twenty trigrams for India

Figure 13. Top twenty trigrams for India

 

 

As can be seen from the top 20 trigrams for India the top concerns are Renewable energy, Renewable energy sector, Renewable energy capacity, Renewable energy sources, New renewable energy, and clean renewable energy. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

Top twenty trigrams for Nigeria

Figure 14. Top twenty trigrams for Nigeria

 

 

As can be seen from the top 20 trigrams for Nigeria the top concerns are Renewable energy, Renewable energy training, Electricity distribution companies, Renewable energy sources, Renewable energy solutions, Solar renewable energy, Renewable energy sector, Affordable prices, Power Supply, Climate change renewables, Public-private sectors, Renewable energy industry, Renewable energy policies, and Access to renewable energy. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

Top twenty trigrams for UK

Figure 15. Top twenty trigrams for UK

 

 

As can be seen from the top 20 trigrams for the United-Kingdom the top concerns are Free renewable energy, Renewable energy sources, Using renewable energy, New renewable energy. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

 Top twenty trigrams for USA

Figure 16. Top twenty trigrams for USA

 

 

As can be seen from the top 20 trigrams for the USA the top concerns are Clean renewable energy, Renewable energy sources, Supporting renewable energy, Renewable fuel standard, Transition into renewable energy, Solar renewable energy, New renewable energy, Using renewable energy, Need for quality products, and renewable energy jobs. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

5. Collocated word clouds & Co-occurrence Network

The following plots display the networks of co-occurring words in tweets in different countries. Here, we visualize the network of top 25 occurring bigrams. The connection between the words confirms the insight derived in the previous section for all cases.

 

 

 Collocate Clouds-India

Figure 17. Collocate Clouds-India

 

 

Co-occurrence Network-India (First 25 Bigrams)

Figure 18. Co-occurrence Network-India (First 25 Bigrams)

 

 

Collocate Clouds-Nigeria

Figure 19. Collocate Clouds-Nigeria

 

 

Co-occurrence Network-Nigeria (First 25 Bigrams)

Figure 20. Co-occurrence Network-Nigeria (First 25 Bigrams)

 

 

Collocate Clouds-UK

Figure 21. Collocate Clouds-UK

 

 

Co-occurrence Network-UK (First 25 Bigrams)

Figure 22. Co-occurrence Network-UK (First 25 Bigrams)

 

 

Collocate Clouds-USA

Figure 23. Collocate Clouds-USA

 

 

Co-occurrence Network-USA (First 25 Bigrams)

Figure 24. Co-occurrence Network-USA (First 25 Bigrams)

 

 

 

 

 

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here