Why are some NGOs more successful in their efforts to use artificial intelligence (AI) efficiently than others?
By Oliver Norkauer
Working with 30+ NGOs
In the last 18 months, Omdena has worked with 30+ NGOs worldwide, creating numerous real-world AI solutions for them. While the projects were very diverse in nature, we found important similarities that went beyond merely delivering an AI-based solution.
In this article, we will look at the differences and similarities of our AI projects for NGOs. We will show the success factors on a journey to become an “AI-enabled NGO” and illustrate them with real-life examples.
Omdena’s projects were very diverse in nature: Some had a clear problem statement, others did not, and in others, the problem description changed during the project. Some projects came with their own data set, others started without any data. In some projects, we had to distill meaningful information out of literally thousands of pieces of text or social media posts, in others we analyzed satellite data, and in others, we had to extract and combine data from multiple sources in various formats.
Omdena’s clients were diverse, too. Some of them wanted to run only one project, while others ran multiple projects with us. Some clients wanted to gain insights from their data, and others wanted a prototype, others a solid working solution for everyday use.
Omdena’s clients have wanted to use a data-driven approach to either improve inefficient processes or to deliver at least one of their services at a much larger scale. They were willing to enrich their expertise and qualitative decision-making processes with insights and predictions generated by data and AI models.
When looking back, we realized two important take-aways:
Our most successful clients used their first project to start a journey towards becoming an “AI-enabled NGO”, although most of them weren’t aware of it. (In the beginning, we weren’t aware of it either.)
All clients went through the same phases in their efforts.
The AI-enabled NGO
An AI-enabled NGO is an organization that recognizes the value of data and uses AI-algorithms to deliver services and programs highly efficiently and at a large scale.
The journey starts where an organization realizes that it can use data to improve current processes and takes the first step to discover concrete options. Once the NGO has realized the potential of data-based solutions, it almost naturally continues on this path, and step by step moves towards becoming an AI-enabled NGO.
On the way to becoming an AI-enabled NGO, all organizations went through the same process:
Phase 1: Discovery – “What can AI do for us”?
Phase 2: Rapid Prototype – “Demonstrate what AI can do for us!”
Phase 3: Productionize – “Have AI deliver concrete value to stakeholders”
Phase 4: Capacity Building – “Use AI-based solutions to build in-house capacity”
We will now look into each of these phases in more detail and illustrate them with examples.
Phase 1: Discovery
In the first phase, organizations need to clarify two main questions:
Which of the problems we face are suited for an AI-based solution?
Which data sources are available within our organization?
During the Discover phase (between two weeks and two months), we will usually find new information that might refine the original answers.
Example 1: Change the problem description
Our project with “Impact Hub Istanbul” started with the following problem description:
Istanbul lies within a region of earthquake activities. Almost inevitably, an earthquake will strike the city again. If this happens during the daytime, families will be spread across different quarters of the city. A lot of roads will be blocked or unsafe. How can families be reunited after an earthquake? Which roads will be safe?
The first challenge was to identify what “safe” and “unsafe” mean in the context of an anticipated earthquake. When we started looking into finding secure pathways, we found that the problem of reuniting families means finding safe routes between schools, hospitals, workplaces, and homes, where all of these locations can basically be anywhere in the city.
So, the problem was not limited to families, but to everybody who needs to find a secure path between two locations, including other NGOs that need to deliver e.g. medical supplies. Thus, the new problem description was: “Calculate the shortest and safest path between two locations in Istanbul after an earthquake.”
Example 2: Identify data sources
In the same project, the NGO itself had no data with which to start. So, we searched for open data, starting with city maps based on satellite images provided by the OpenStreetMap (OSM) Foundation. When looking at these maps, we realized that they showed roads within the city, but did not allow predictions which roads would be safe after an earthquake.
Which pathways would be safe? We decided that broad roads and green areas, i.e., areas without houses, would be safe, as smaller roads might be blocked by debris. As the OSM data does not provide road width information, we identified rooftops from these satellite images and calculated the space between these rooftops as a measure of road width.
With these new data sets, we could develop a prototype of an algorithm to find the safest route between two locations in Istanbul after an earthquake.
In phase 1, we clarified the most important questions. We next need to find potential answers and show them to stakeholders.
Phase 2: Rapid Prototype
After identifying the problem statement and data sources, phase 2 starts. In this phase, the NGO needs to verify the value of an AI-solution both internally and externally, so they get the desired stakeholder support, which in turn can lead to additional funding.
This phase should not take longer than two months.
Example 3: Rapid Prototype
The goal of a project with TrashOut was to “build machine learning models on illegal dumping(s) to see if there are any patterns that can help to understand what causes illegal dumping(s), predict potential dumpsites, and eventually how to avoid them”.
In two months, we have built a prototype based on data from TrashOut and combined this with two other data sets. The prototype not only showed existing dumpsites (image 1), but also predicted the probability of illegal dumpsites in the form of a heatmap (image 2).
Image 1: Show existing locations
Using Machine Learning to predict illegal dumpsites
Image 2: Predicted probability
Customer quote from Lucia Kelnarová, Project Leader Trashout
“Amazing work done in a super short time. We hope to implement the work and make an impact on the world.”
While prototyping is an essential step, it does not provide a final solution. So, as the next step, we need to create a reliable, proven product for everyday use.
Phase 3: Productionize
While phases one and two can be regarded as exploratory steps, in phase three, the prototype will be developed into a solution to deliver value for stakeholders. In this phase, the prototypes’ algorithms become more robust and often increase their reliability. To accomplish this, new data sources might be identified, the model will run through new training cycles and will need to be adjusted.
This is a major step in implementing a real-world AI-solution and a lot of organizations underestimate the effort required and experience difficulties in deployment. The “2020 state of enterprise machine learning” report by Algorithmia reports that about half of deployments take between eight and ninety days, while 18-36% take up to one year or longer.
In the Productionize phase, the NGO needs to have some technical skills in-house, but usually also uses external consultants. Ideally, these consultants were already involved in building the prototype.
In a two-month project for the World Resources Institute (WRI), the problem statement was to “create a machine learning algorithm that can be used as a proxy for socio-economic well-being in India”. We first built a prototype, using both census and satellite data, which had reliability of 60-75%. In the productionizing phase, we added more satellite data sources, improving reliability to 85-90%.
Phase 4: Capacity Building
The first two phases were exploratory, the third phase puts the first AI-based solution into production, starting to provide value to the NGO’s stakeholders.
In phase four, the NGO starts a larger project, not only to create another AI-based solution but to also build up internal skills and therewith building capacity at a larger scale. This is a longer part of the journey, taking 3-6 months.
Quote from Saurav Suman, United Nations World Food Program
“The collaborative approach of Omdena is taking innovation to a whole new level with the idea of leveraging technology to bring in people with different capacities and work on a problem. The driving force behind this approach is the accelerated learning through collaborative spirit, mentoring and spot-on guidance. On top of all that are the humanitarian problems that Omdena is working on. WFP Nepal is proud to have worked together with Omdena on one of the projects addressing zero hunger “crop area identification project”. We believe this is the start of a long journey together.”
In the earlier projects, only a few NGO staff were involved as domain experts. In this new project, there will be more NGO staff and they will now contribute more broadly during the project.
Omdena’s people-, tool-, and process-based approach not only leads to a technical AI-based solution, but also to more collaboration among the NGO’s workforce. Our proven process ensures that everybody is actively involved, communicates, and contributes towards the common goal. In detail:
Collaboration establishes a trusted, reliable, and non-hierarchical communication structure that can be expanded throughout the organization;
Collaboration breaks down organizational silos, one of the biggest impediments to innovation and efficiency in complex environments;
Agile techniques speed up collaboration and yield more creative solutions;
Diverse teams not only lead to better results, they also lead to more openness;
Working on projects that achieve goals at a large scale will improve loyalty and dedication among your workforce.
Example 5: An AI-based tool providing value
A project with the WRI aimed to provide a machine-learning-based methodology that identifies land conflict events in several regions in India and matches those events to relevant government policies. The overall objective was to offer a platform where policymakers could be made aware of land conflicts as they unfold and identify existing policies that are relevant to the resolution of those conflicts.
We developed a visualization app as a prototype within two months. WRI showed this app to sponsors and donors, and secured funding for follow-up projects, which expanded the scope of the project from India to other countries. This, in turn, leveraged the visibility of the project on a wider scale.
Quote from John Brandt, WRI:
“We’re really excited about the results of this project. My team currently uses the code and infrastructure on an almost weekly basis. […] We’re very excited that the results from this partnership were very accurate and very useful to us, and we’re currently scaling up the results to develop sub-national indices of environmental conflict for both Brazil and Indonesia, as well as validating the results in India with data collected in the field by our partner organizations. This data can help supply chain professionals mitigate risk in regards to product-sourcing.”
With AI-based solutions, NGOs can provide services highly efficiently and at a much larger scale. Omdena’s experience shows that an organization’s journey to becoming an AI-enabled NGO goes through four phases if they want to fully realize the potential of AI:
In phase 1 (“Discovery”), they find out what AI can do for them;
In phase 2 (“Rapid Prototyping”), a prototype product demonstrates what a real solution can do for the stakeholders;
In phase 3 (“Productionalize”) the prototype is expanded and made more robust. At the end of that phase, the AI solution is available for everyday use;
In phase 4 (“Capacity Building”), the organization is building up internal skills to fully use the potential of their AI-based solutions
Omdena has worked with 30+ NGOs, and it has supported NGOs on their journey to become an AI-enabled NGO. Omdena offers a project-, process-, and tools-based approach, which may be complemented with consulting and training services when needed.
How can AI and Natural Language Processing (NLP) help alleviate social workers’ administrative burden in case management?
By Shrey Grover and Jianna Park
The social service sector is increasingly showing an interest in turning to data-driven practices, which, until now, were predominantly utilized by the commercial counterpart.
Some of the key reforms that social organizations expect from leveraging data include:
Harness the potential of the underlying gold mine of expert knowledge
Relieve the limited staff of repetitive administrative and operational tasks
Address missions on a shorter timeline with enhanced efficiency
Yet, according to IBM’s 2017 study, most of the sector seems to be in the early stages of the data journey, as shown in the visual below. Constrained budget, access to technology, and talent are cited as the major hurdles in utilizing analytical services.
We at Omdena had an unparalleled opportunity to work on one of such nascent stage projects for International Social Service (ISS), a 96-year old NGO that massively contributes to resolving child protection cases.
Why This Project?
ISS has a global network of expertise in providing essential social services to children across borders — mainly in the domains of child protection and migration. However, with over 70,000 open cases per year, ISS caseworkers were facing challenges in two aspects: managing time, and managing data. The challenges were often exacerbated by administrative backlogs and a high turnover rate not uncommon in the nonprofit sector.
If we could find a way to significantly reduce the percentage of time lost on repetitive administrative work, we could focus on the more direct, high-impact tasks, helping more children and families access better quality services. ISS saw an urgent need for a technological transformation in the way they managed cases — and this is where Omdena came into the picture.
The overarching goal of this project was to improve the quality of case management and avoid unnecessary delays in service. Our main question remained, how can we help save caseworkers’ time and leverage their data in a meaningful way?
It was time to break down the problem into more specific targets. We identified factors that hinder ISS caseworkers from focusing on the client-facing activities, seen in the following picture.
We saw that these subproblems could each be solved with different tools, which would help organizations like ISS understand the various ways that machine learning can be integrated with their existing system.
We also concluded that if we could manage data in a more streamlined manner, we could manage time more efficiently. Therefore, we decided that introducing a sample database system would prove to be beneficial.
The biggest challenge we faced as a team was the data shortage. ISS had a strict confidentiality agreement with their clients, which meant they couldn’t simply give us raw case files and expose private information.
Initially, ISS gave us five completed case files with names either masked or altered. As manually editing cases would have taken up caseworkers’ client-facing time, our team at Omdena had to find another way to augment data.
Our team collectively tackled the main problem from various angles, as follows:
As we had only five data points to work with and were not authorized to access ISS’s data pool, we clarified that our final product would be a proof-of-concept rather than a production-ready system.
Additionally, keeping in mind that our product was going to be used by caseworkers who may not have a technical background, we consolidated the final deliverables into a web application with a simple user interface.
We relied on a supervised learning approach for our risk score prediction model. For this, we manually labeled risk scores for each of our cases. A risk score, a float value ranging from 0 to 1, intends to highlight priority cases (i.e. cases which involve a threat to a child’s wellbeing or have tight time constraints) and require the immediate attention of the caseworkers.
The scores were given by taking into consideration various factors such as the presence and a history of abuse within the child’s network, access to education, caretaker’s willingness to care for the child, and so on. To reduce bias, three collaborators provided their risk score input for each case, and the average of the three was considered as the final risk score for that case.
Finally, we demarcated the risk scores into three categories, using the following threshold.
Additionally, we augmented our data — which originally only contained case text — by adding extra information such as case open and close dates, type of service requested, and country where the service is requested. Using this, we created a seed file to populate our sample database. These parameters would later help caseworkers see how applying a simple database search and filter system can enable dynamic data retrieval.
Next, we moved to data preprocessing which is crucial in any data project pipeline. To generate clean, formatted data, we implemented the following steps:
Text Cleaning: Since the case texts were pulled from different sources, we had different sets of noises to remove, including special characters, unnecessary numbers, and section titles.
Lowercasing: We converted the text to lower case to avoid multiple copies of the same words.
Tokenization: Case text was further converted into tokens of sentences and words to access them individually.
Stop word Removal: As stop words did not contribute to certain solutions that we worked on, we considered it wise to remove them.
Lemmatization: For certain tasks like keyword and risk factor extraction, it was necessary to reduce the word to its lemmatized form (eg. “crying” to “cry,” “abused” to “abuse”), so that the words with the same root are not addressed multiple times.
We had to convert the case texts into compact numerical representations of fixed lengths to make them machine-readable. We considered four different types of embedding methods — Term Frequency Inverse Document Frequency (TFIDF), Doc2Vec, Universal Sentence Encoder (USE), and Bidirectional Encoder Representations (BERT).
To choose the one that works best for our case, we embedded all cases using each embedding method. Next, we reduced the embedding vector size to 100 dimensions using Principal Component Analysis (PCA). Then, we used a hierarchical clustering method to group similar cases as clusters. To find the optimal number of clusters for our problem, we referred to the dendrogram plot. We finally evaluated the quality of the clusters using Silhouette scores.
After performing these steps for all four algorithms, we observed the highest Silhouette score for USE embeddings, which was then selected as our embedding model.
Models & Algorithms
Multiple pre-trained extractive summarizers were tried, including BART, XLNet, BERT-SUM, and GPT-2 which were made available thanks to the HuggingFace Transformers library. As evaluation metrics such as ROUGE-N and BLEU required a lot more reference summaries than what we had, we opted for relative performance comparison and checked for the quality and noise level of each model’s outcomes. Then, inference speed played a major role in determining the final model for our use case, which was XLNet.
Keyword & Entity Relation Extraction
Keywords were obtained from each case file using RAKE, a keyword extraction algorithm that determines high-importance phrases based on their frequencies in relation to other words in the text.
For entity relations, several techniques using OpenIE and AllenNLP were tried, but they each had their own set of drawbacks, such as producing instances of repetitive information. So we implemented our own custom relation extractor utilizing spaCy, which better-identified subject and object nodes as well as their relationships based on root dependency.
The pairwise similarity was computed between a given case and the rest of the data based on USE embeddings. Among Euclidean distance, Manhattan distance, and cosine similarity, we chose cosine similarity as our distance metric for two reasons.
First, it works well with unnormalized data. Second, it takes into account the orientation (i.e. angle between the embedding vectors) rather than the magnitude of the distance between the vectors. This was favorable for our task as we had cases of various lengths, and needed to avoid missing out on cases with diluted yet similar embeddings.
After getting similarity scores for all cases in our database, we fetched top five cases that had the highest similarity values to the input case.
Risk Score Prediction
A number of regression models were trained using document embeddings as input and manually labeled risk scores as output. Tensorflow’s AutoKeras, Keras, and XGBoost were some of the libraries used. The best performing model — our custom Keras neural network sequential model — was selected based on root mean square error (RMSE).
Abuse Type & Risk Factor Extraction
We created more domain-specific tools to generate another source of insight via algorithms to find primary abuse types and risk factors.
For the abuse type extraction, we defined eight abuse-related verb categories such as “beat,” “molest,” and “neglect.” spaCy’s pre-trained English model en_core_web_lg and part-of-speech (POS) tagging were used to extract verbs and transform them into word vectors. Using cosine similarity, we compared each verb against the eight categories to find which abuse types most accurately capture the topic of the case.
Risk factor extraction works in a similar way, in that text was also tokenized and preprocessed using spaCy. This algorithm, however, further extended the previous abuse verbs by including additional risk words, such as “trauma,” “sick,” “war,” and “lack.” Instead of only looking at verbs, we compared each word in the case text (excluding custom stop words) against the risk words. Words that had a similarity score of over 0.65 with any of the risk factors were presented in their original form. This addition aimed to provide more transparency over what words may have affected the risk score.
To put these models altogether in a way that ISS caseworkers could easily understand and use, a simple user interface was developed using Flask, a lightweight Python web application framework. We also created forms via WTForms and graphs via Plotly, and let Bootstrap handle the overall stylization of the UI.
For the database, we used PostgreSQL, a relational database management system (RDBMS), along with SQLAlchemy, an object-relational mapper (ORM) that allows us to communicate with our database in a programmatic way. Our dataset, excluding the five confidential case files initially provided by ISS, was seeded into the database, which was then hosted on Amazon RDS.
A public Tableau dashboard to visualize the case files was also added, should caseworkers wish to refer to external resources and gain further insight on case outcomes.
The aim of this project was to assist ISS in offering services in a timely manner, bearing in mind the organizational history of knowledge available. Within eight weeks, we achieved this goal by providing an application prototype that would help caseworkers understand some of the various ways to leverage the power of data.
This tool, upon continued development, will be the first step toward ISS’s AI journey. And with enhanced capabilities, both experienced and less experienced caseworkers will be able to make better-informed decisions.
Our models do come with some limitations, mainly stemming from limited data due to privacy reasons and time constraints.
As our dataset only accounted for two types of services (child abuse and migration), and came from a small number of geographical sources, the risk score prediction model may contain biases. Bias could also have been induced by the manual labeling of risk scores, which was done based on our assumptions.
The solutions provided are not meant to replace human involvement. As 100% accuracy of a machine learning model can be difficult to achieve, the tool works best in combination with the judgment of a caseworker.
Moving Forward: AI and case management
To bring this tool closer to a production level, a few improvements can be made.
Furthermore, risk scores can be validated by the ISS caseworkers. They can provide even more risk scores as they use our prediction model, to enable continuous learning.
The database fields can be more granular and include additional attributes as caseworkers see fit. For example, the current field “case_text” can be divided into “background_info,” “outcome,” and so on. This will create more flexibility in document search and flagging missing information.
Finally, once the app is productionized and deployed on a platform like AWS, ISS caseworkers across the globe will have access to these tools, as well as access to the entire network’s pool of resources — truly empowering caseworkers to do the work that matters.
Nature-based solutions (NbS) can help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes.
By Bala Priya, Simone Perazzoli, Nishrin Kachwala, Anju Mercian, Priya Krishnamoorthy, and Rosana de Oliveira Gomes
NbS harness nature to tackle environmental challenges that affect human society, such as climate change, water insecurity, pollution, and declining food production. NbS can also help societies and ecosystems adapt to drastic changes in the climate and mitigate the adverse impacts of such changes through, for example, growing trees in rural areas to boost crop yields and lock water in the soil. Although many organizations have investigated the effectiveness of NbS, these solutions have not yet been analyzed to the full extent of their potential.
In order to analyze such NbS approaches in greater detail, World Resources Institute (WRI) partnered with Omdena to better understand how regional and global NbS, such as forest and landscape restoration, can be leveraged to address and reduce climate change impacts across the globe.
In an attempt to identify and analyze such approaches, we investigated three main platforms that bring organizations together to promote initiatives that restore forests, farms, and other landscapes and enhance tree cover to improve human well-being:
Initiative 20×20: goal to begin restoring or protecting 50 million hectares by 2030 across Latin America and the Caribbean
Cities4Forests: partner with leading cities around the world to protect and restore forests
Considering the aforementioned, the project goal is to assess the network of these three coalition websites through a systematic approach and to identify climate adaptation measures covered by these platforms and their partners.
The integral parts of the project’s workflow included building a scalable data collection pipeline to scrape the data from the platforms and partnering organizations, and several useful PDFs; leveraging several techniques from Natural Language Processing, such as building a Neural Machine Translation pipeline to translate non-English text to English, performing sentiment analysis for identifying potential gaps, experimenting with language models that were optimal for the given use cases, exploring various supervised and unsupervised topic modeling techniques to get meaningful insights and latent topics present in the voluminous text data collected, leveraging the novel Zero Shot Classification(ZSC) to identify the impacts and interventions, building a Knowledge-Based Question Answering(KBQA) system, and recommender system.
The platforms engaged in climate-risk mitigation were studied for several factors, including the climate risks in each region, as well as initiatives taken by the platforms and their partners, NbS employed for mitigating climate risks, the effectiveness of adaptations, goals, road map of the platform, among others. This information was gathered through:
a) Heavy scraping of Platform websites: It involved data scraping from all the platforms´ website pages using Python scripts. This process involved manual effort in customizing the scraping suitable for each page; accordingly, extending this model would involve some effort. Approximately 10MB of data was generated through this technique.
b) Light Scraping of Platform websites and partner organizations: it involved the obtention of the platforms sitemap. Once it was done, organization websites were crawled to obtain the text information. This method can be extended to other platforms with minimal effort. The volume of this data generated is around 21MB.
c) PDF Text Data Scraping of Platform and Other Sites: The platform websites presented several informative PDF documents (including reports and case studies), which were helpful for use in the downstream models, including the Q&A system, recommendation system, etc. This process was completely automated by the PDF text-scraping pipeline, which prepares a CSV of the PDF data and then generates a consolidated CSV file containing paragraph text from all the PDFs mentioned in the input CSV file. This pipeline can be incrementally used to generate the PDF text in batches. The NLP models utilized all of the PDF documents from the three platform websites, as well as some documents containing general information on NbS referred by WRI. Approximately 4MB of data was generated from the available PDFs.
a) Data Cleaning: an initial step comprising the removal of unnecessary text, text with length <50, as well as duplicates.
b) Language Detection and Translation: This step involved the development of a pipeline for language detection and translation to be applied to text data gathered from the 3 main sources described above.
Language Detection was performed by analyzing text using different deep learning pre-trained models such as langdetect and pycld3. Once the language is detected, the result is used as an input parameter for the translation pipeline. In this step, pre-trained multilingual models are downloaded from the Helsinki NLP repository available at HuggingFace.com (an NLP processing company). Text is tokenized and organized in batches to be sequentially fed into the pre-trained model. To enhance function performance, the pipeline was developed with GPU support, if available. Also, once a model is downloaded. it is cached into the program memory so it doesn’t need to be downloaded again.
The translation performed well with the majority of texts to which it was applied (most were in Spanish or Portuguese), being able to generate well-structured and coherent results, especially considering the scientific vocabulary of the original texts.
c) NLP preparation: This step was applied to the CSVs files generated through the scrap, after translation pipeline and was composed by Punctuation removal, Stemming, Lemmatization, Stop words removal, Part of Speech tagging (POS), Tagging, Chunking
Statistical Analysis was performed on the preprocessed data by exploring the role of climate change impacts, interventions, and ecosystems involved in the three platforms’ portfolios using two different approaches: zero-shot classification and cosine similarity.
1.) Zero Shot Classification: This model assigns probabilities to which user-defined labels a text would best fit. We applied a zero-shot classification model from Hugging Face to classify descriptions for a given set of keywords belonging to climate-change impacts, interventions, and ecosystems for each of the three platforms. For ZSC, it combined the heavy-scraped datasets into one CSV for each website. The scores computed by ZSC can be interpreted as probabilities that the class belongs to a particular description. As a rule, we only considered as relevant those scores at or above 0.85.
Let’s consider the following example:
Description: The Greater Amman Municipality has developed a strategy called Green Amman to ban the destruction of forests The strategy focuses on the sustainable consumption of legally sourced wood products that come from sustainably managed forests The Municipality sponsors the development of sustainable forest management to provide long term social economic and environmental benefits Additional benefits include improving the environmental credentials of the municipality and consolidating the GAMs environmental leadership nationally as well as improving the quality of life and ecosystem services for future generations.
Model Predictions: The model assigned the following probabilities based upon the foregoing description:
For the description above, we see that the Climate-Change-Impact prediction is ‘loss of vegetation’, Types-of- Intervention prediction is ‘management’ or ‘protection’, and the Ecosystems prediction is empty.
2.) Cosine Similarity: Cosine Similarity compares vectors created by keywords, generated through Hugging Face models…. (how these keywords were computed) and descriptions, and scores the similarity in direction of these vectors. We then plot the scores with respect to technical and financial partners and a set of keywords. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations. This approach was useful to validate the results of the ZSC approach.
Aligning these results, it was possible to answers the following questions:
What are the climate hazards and climate impacts most frequently mentioned by the NbS platforms’ portfolios?
What percentage of interventions/initiatives take place in highly climate-vulnerable countries or areas?
What ecosystem/system features most prominently in the platforms when referencing climate impacts?
This model was applied on descriptions from all three heavy-scraped websites, and compared cross-referenced results (such as Climate Change Impact vs Intervention, or Climate Change Impact vs Ecosystems, or Ecosystems vs Intervention) for all three websites. Further, we created plots based on country and partners (technical and financial) for all three websites.
Sentiment Analysis (SA) is the automatic generation of sentiment from text, utilizing both data mining and NLP. Here, SA is applied to identify potential gaps and solutions through the corpus text extracted from the three main platforms. In this Task, it implemented the following well consolidated unsupervised approaches: VADER, TextBlob, AFINN, FlairNLP, AdaptNLP Easy Sequence Classification. A new approach, Bert-clustering, was proposed by Omdena Team and it is based on Bert embedding of a positive/negative keywords list and computing distance(s) of these embedded descriptions to the corresponding cluster, were:
negative reference: words related to challenges and hazards, which give us a negative sentiment
positive reference: words related to NBS solutions, strategies, interventions, and adaptations outcomes, which give us a positive sentiment
For modeling purposes, the threshold values adopted are presented in table 1.
According to the scoring of the models presented in Table 2, AdaptNLP, Flair, and BERT/clustering approaches exhibited better performance compared to the lexicon-based models. Putting the limitations of unsupervised learning aside, BERT/clustering is a promising approach that could be improved for further scaling. SA can be a challenging task, since most algorithms for SA are trained on ordinary-language comments (such as from reviews and social media posts), while the corpus text from the platforms has a more specialized, technical, and formal vocabulary, which raises the need to develop a more personalized analysis, such as the BERT/clustering approach.
Across all organizations, it was observed that the content focuses on solutions rather than gaps. Overall, potential solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.
Topic Modeling is a method for automatically finding topics from a collection of documents that best represent the information within the collection. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together. The algorithms/ techniques that were explored for the project include Top2Vec, SBERT: and Latent Dirichlet Allocation (LDA) with Gensim and Spacy.
Top2Vec: For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
LDA: In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution. A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, I20x20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more. One can further group the URLs by color grouping and analyze the data in greater depth. A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.
Other NLP/ML techniques
Besides the aforementioned above, other techniques were also exploited in this project and will be presented in further articles, such as
Network Analysis presents interconnections among the platform, partners, and connected websites. A custom network crawler was created, along with heuristics such as prioritizing NbS organizations over commercial linkages (this can be tuned) and parsing approx. 700 organization links per site (this is another tunable parameter). We then ran the script with different combinations of source nodes (usually the bigger organizations like AFR100, INITIATIVE20x20 were selected as sources to achieve the required depth in the network). Based on these experiments, we derived a master set of irrelevant sites (such as social media, advertisements, site-protection providers, etc.) that are not crawled by our software.
Knowledge Graphs represent the information extracted from the text in the websites based on the relationships between them. A pipeline was built to extract the triplets based upon the subject / object relationship using StanfordNLP’ss OpenIE on the paragraph. Subjects and objects are represented by nodes, and relations by the paths (or “edges”) between them.
Recommendation Systems: The recommender systems application is built based upon the information extracted from the partner’s websites, with a goal to provide recommendations of possible solutions already available and implemented within the network of partners from WRI. The application allows a user to search for similarities across organizations (collaborative filtering) as well as similarities in the content of the solution (content-based filtering).
Question & Answer System: Our knowledge-based Question & Answer system answers questions in the domain context of the text scraped data from the PDF documents from the main platform websites, as well as a few domain-related PDF documents which contain the climate risks and NbS information, as well as the light-scraped data obtained from the platforms and their partner websites.
The KQnA system is based on Facebook’s Deep Passage Retrieval method which provides better context by generating vector embeddings. The RAG neural network(RAG) generates a specific answer for a given question conditioned on the retrieved documents. RAG gives the most of an answer from the shortlisted documents. The KQnA system is built on the open-source Deepset.ai Haystack framework and hosted on a virtual machine, accessible via REST API to the Streamlit UI.
The platform websites have many PDF documents containing extensive significant information that would take a lot of time for humans to process. The Q&A system is not a replacement for human study or analysis but helps ease such efforts by obtaining the preliminary information, linking the reader to the specific documents which have the most relevant answers. The same method was extended to light-scraped data, broadly covering the platform websites and their partner websites.
The PDF and light-scraped documents are stored on two different indices on Elasticsearch to run the query on the streams separately. Deep Passage Retrieval is laid on the Elasticsearch Retriever for contextual search, providing better answers. Filters of Elasticsearch can be applied on the platform/URL for the focused search on a particular platform or website. Elastic search 7.6.2 is installed on VM which is compatible with Deepset.ai Haystack. RAG is applied to the generated answers to get a specific answer. Climate risks, NbS solutions, local factors, and investment opportunities are queried on PDF data and Platform data. Questions on the platform for PDF data, URL for light scraped data can be performed for localized search.
By developing decision-support models and tools, we hope to make the NbS platforms’ climate change-related knowledge useful and accessible for partners of the initiative, including governments, civil society organizations, and investors at the local, regional, and national levels.
Any of these resources can be augmented with additional platform data, which would require customizing the data gathering effort per website. WRI could extend the keywords used in statistical analysis for hazards, the types of interventions, the types of ecosystems, and create guided models to gain further insights.
Data gathering pipeline
We have provided very useful utilities to collect and aggregate data and PDF content from websites. WRI can extend the web-scraping utility from the leading platforms and their partners to other platforms with some customization and minimal effort. Using the PDF utility, WRI can retrieve texts from any PDF files. The pre-trained multilingual model in the translation utility can translate the texts from various sources to any language.
Using zero-shot classification, predictions were made for the keywords that highlight Climate Hazards, Types of Interventions, and Ecosystems, based upon a selected threshold. Cosine similarity predicts the similarity of a document with regard to the keywords. Heat maps visualize both of these approaches. A higher similarity score means the organization is more associated with that hazard or ecosystem than other organizations.
SA identifies potential gaps from negative connotations derived from words related to challenges and hazards. A tree diagram visualizes the sentiment analysis for publications/partners/projects documents from each platform. Across all organizations, the content focuses on solutions rather than gaps. Overall, solutions and possible solutions make up 80% of the content, excluding neutral sentiment. Only 20% of the content references potential gaps. Websites typically focus more on potential gaps, while projects and partners typically focus on finding solutions.
Topic models are useful for identifying the main topics in documents. This provides high-level summaries of an extensive collection of documents, allows for a search for records of interest, and groups similar documents together.
With semantic search with Top2Vec. For which word clouds of weighted sets of words best represented the information in the documents. The word cloud example from Topic Modeling shows that Topic is about deforestation in the Amazon and other countries in South America.
S-BERT: Identifies the top Topics in texts of projects noted from the three platforms. The top keywords that emerged from each dominant Topic were manually categorized, as shown in the table. The texts from projects refer to Forestry, Restoration, Reservation, Grasslands, Rural Agriculture, Farm owners, Agroforestry, Conservation, Infrastructure in Rural South America.
In LDA topic modeling, once you provide the algorithm with the number of topics, it rearranges the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of the topic-keywords distribution.
A t-SNE visualization of keywords/topics in the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests) is available on the app deployed via Streamlit and Heroku.
The distance in the 3D space among points represents the closeness of keywords/topics in the URL
The color dot represents an organization, hovering over a point provides more information about the Topics referred to in the URL and more.
One can further group the URLs by color grouping and analyze the data in greater depth.
A t-SNE plot representing dominant keywords indicated from the three platforms’ Partner organization documents. Each color of the dot represents a partner organization.
This work has been part of a project with World Resources Insitute.
Although many organizations have investigated the effectiveness of nature-based solutions (NBS) to help people build thriving urban and rural landscapes, such solutions have not yet been analyzed to the full extent of their potential. With this in mind, World Resources Institute (WRI) partnered with Omdena to understand how regional and global NbS can be leveraged to address and reduce the impact of climate change.
Our objective was to understand how three major coalitions, all of which embrace the key NBS forest and landscape restoration, use their websites to build networks. We used a systematic approach and to identify the climate adaptation measures that these platforms and their partners feature on their websites.
As the first step, information from the three NbS platforms and their partners, and relevant documents were collected using a scalable data collection pipeline that the team built.
Why Topic Modeling?
Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and thousands of chunks of text. Given the huge volume of text data thus obtained, and due to the infeasibility of manually analyzing the large text dataset to understand and gain meaningful insights, we had leveraged the use of Topic Modeling– a powerful NLP technique to understand the impacts, NbS approaches involved and the various initiatives in the direction.
A topic is a collection of words that are representative of specific information in text form. In the context of Natural Language Processing, extracting latent topics that best describe the content of the text is described as Topic modeling.
Topic Modeling is effective for:
Discovering hidden patterns that are present across the collection of topics.
Annotating documents according to these topics.
Using these annotations to organize, search, and summarize texts.
It can also be thought of as a form of text mining to obtain recurring patterns of words in a corpus of text data.
The team experimented with topic modeling approaches that fall under unsupervised learning, semi-supervised learning, deep unsupervised learning, and matrix factorization. The team analyzed the effectiveness of the following algorithms in the context of the problem.
Topic Modeling using Sentence BERT (S-BERT)
Latent Dirichlet Allocation (LDA)
Non-negative Matrix Factorization (NMF)
Correlation Explanation (CorEx)
Top2Vec — is an unsupervised algorithm for topic modeling and semantic search. It automatically detects topics present in the text and generates jointly embedded topic, document, and word vectors
Data Sources used in this modeling approach are the data obtained from the heavy scraping of the platforms Initiative 20×20 and Cities4Forests, data from the light scraping pipeline, and combined data from all websites. Top2Vec performs well on reasonably large datasets;
There are three key steps taken by Top2Vec.
Transform documents to numeric representations
Clustering of documents to find topics.
The topics present in the text are visualized using word clouds. Here’s one such word cloud that talks about deforestation, loss of green cover in certain habitats and geographical regions.
The algorithm can output from which platform and which line in the document the topics were found. This can help identify the organizations working towards similar causes.
In this approach, we aim at deriving topics from clustered documents, using a class-based variant of Term Frequency- Inverse Document Frequency score (c-TF-IDF), which would allow extracting words that make each set of documents or class stand out as compared to the others
The intuition behind the method is as follows. When one applies TF-IDF as usual on a set of documents, one compares the importance of words between documents. For c-TF-IDF, one treats all documents in a single category (e.g., a cluster) as a single document and then applies TF-IDF. The result is a very long document per category and the resulting TF-IDF score would indicate the important words in a topic.
The S-BERT package extracts different embeddings based on the context of the word. Not only that, there are many pre-trained models available ready to be used. The number of top words that occur per Topic with its scores is shown below.
Latent Dirichlet Allocation (LDA)
Collecting all texts, documents, and reports by web scraping of the three platforms resulted in hundreds of documents and millions of chunks of text. Using Latent Dirichlet Allocation (LDA), a popular algorithm for extracting hidden topics from large volumes of text, we discovered topics covering NbS and Climate hazards underway at the NbS platforms.
LDA’s approach to topic modeling is that it considers each document to be a collection of various topics. And each topic as a collection of words with certain probability scores.
In practice, the topic structure, per-document topic distributions, and the per-document per-word topic assignments are latent and have to be inferred from observed documents.
Once the number of topics is fed to the algorithm, it will rearrange the topic distribution in documents and word distribution in topics until there is an optimal composition of the topic-word distribution.
LDA with Gensim and Spacy
As every algorithm has its pros and cons, Gensim is no different than all.
Pros of using Gensim LDA are:
Provision to use N-grams for language modeling instead of only considering unigrams.
pyLDAvis for visualization
Gensim LDA is a relatively more stable implementation of LDA
Two metrics for evaluating the quality of our results are the perplexity and coherence score
Topic Coherence measures score a single topic by measuring how semantically close the high scoring words of a topic are.
Perplexity is a measure of surprise, which measures how well the topics in a model match a set of held-out documents; If the held-out documents have a high probability of occurring, then the perplexity score will have a lower value. The statistic makes more sense when comparing it across different models with a varying number of topics. The model with the lowest perplexity is generally considered the “best”.
We choose the optimal number of topics, by plotting the number of topics against the coherence scores they yield and choose the one that maximizes the coherence score. On the other hand, if the number of seen repetitions of words is high in the final results, we should choose lower values for the number of topics regardless of the lower coherence score.
One of the major causes that can help to provide better final evaluations for Gensim is the mallet library. Mallet library is an efficient implementation of LDA. It runs faster and gives better topics separation.
Visualizing the topics using pyLDAvis gives a global view of the topics and how they differ in terms of inter-topic distance. While at the same time allowing for a more in-depth inspection of the most relevant words that occur in individual topics. The size of the bubble is proportional to the prevalence of the topic. Better models have relatively large, well-separated bubbles spread out amongst the quadrants. When hovering over the topic bubble, its most dominant words appear on the right in a histogram.
Using the 10k+ unique URLS inside 34 Partner organization websites (partners of AFR100, Initiative 20×20, and Cities4Forests), documents were scraped, and topics were extracted with Python’s LDA Gensim package. For visualizing complex data in three dimensions, we used scikit-learn’s t-SNE with Plotly.
Below is a visual of the Partner organization’s 3D projection for which the topic distributions were grouped manually. The distance in the 3D space among points represents the closeness of keywords/topics in the URL. The color dot represents an organization. Hovering over a point provides more information about the Topics referred to in the URL. One can further group the URLs by color and analyze the data in greater depth.
Non-negative Matrix Factorization (NMF)
Non-negative Matrix Factorization is an unsupervised learning algorithm.
It takes in the Term-Document Matrix of the text corpus and decomposes into the Document-Topic matrix and the Topic-Term matrix that quantifies how relevant the topics are in each document in the corpus and how vital each term is to a particular topic.
We use the rows of the resulting Term-Topic Matrix to get a specified number of topics. NMF is known to capture diverse topics in a text corpus and is especially useful in identifying latent topics that are not explicitly discernible from the text documents. Here’s an example of the topic word clouds generated on the light scraped data. When we would like the topics to be within a specific subset of interest or contextually more informative, we may use semi-supervised topic modeling techniques such as Guided LDA (or Seeded LDA) and CorEx(Correlation Explanation) models.
Guided Latent Dirichlet Allocation (Guided LDA)
Guided LDA is a semi-supervised topic modeling technique that takes in certain seed words per topic, and guides the topics to converge in the specified direction.
When we would like to get contextually relevant topics such as climate change impacts, mitigation strategies, and initiatives, setting a few prominent seed keywords per topic enables us to obtain topics that help understand the content of the text in the directions of interest.
For example, in the data from the platform cities4forests, the following are some of the seed words that were used, to get topics containing the most relevant keywords.
CorEx is a discriminative topic model. It estimates the probability a document belongs to a topic given the content of that document’s words and can be used for discovering themes from a collection of documents, then further analysis such as clustering, searching, or organizing the collection of themes to gain insights.
The Total Correlation (TC) is a measure that CorEx maximizes when constructing the topic model. CorEx starts its algorithm with the random initialization, and so different runs can result in different topic models. A way of finding the best topic model is to run the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents). The topic’s underlying meaning is often interpreted by individuals building the models, and are given a name or category to reflect the topic’s understanding. This interpretation is a subjective exercise. Using anchor keywords domain-specific topics (NbS and climate change in our case) can be integrated into the CorEx model alleviating some interpretability concerns. The TC measure for the model with and without anchor words is below. The anchored models showing a better performance.
After hyperparameter tuning the anchored model with anchor strength, anchor words, number of topics, making several runs for the best model, cleaning up of duplicates, the top topic is shown below.
plants animals, socio bosque, sierra del divisor, national parks, restoration degraded, water nutrients, provide economic, restoration project
An Interpretation => National parks in Peru and Ecuador, which were significantly losing hectares to deforestation, are in restoration by an Initiative 20×20-affiliated project. This project also protects the local economy and endangered animal species.
From the analysis of various topic modeling approaches, we summarize the following.
Compared to other topic modeling algorithms Top2vec is easy to use and the algorithm leverages joint document and word semantic embedding to find topic vectors, and does not require the text pre-processing steps of stemming, lemmatization, or stop words removal.
For Latent Dirichlet Allocation, the necessary text pre-processing steps are needed to obtain optimal results. As the algorithms do not use contextual embeddings, it’s not possible to account for semantic relationships completely even when considering n-gram models.
Topic Modeling is thus effective in gaining insights about latent topics in a collection of documents, which in our case was domain-specific, concerning documents from platforms addressing climate change impacts.
Limitations of topic modeling include the requirement of a lot of relevant data and consistent structure to be able to form clusters and the need for domain expertise to interpret the relevance of the results. Discriminative models with domain-specific anchor keywords such as CorEx can help in topic interpretability.
Applying various data science tools and methods to visualize climate change impacts.
By Nishrin Kachwala, Debaditya Shome, and Oscar Chan
Day by day, as we generate exponentially more data, we also sift through its complexity and consume more. Filtering out relevancy is essential to get to the gist of the data in front of us. It is a well-known fact that the human brain absorbs a picture 60,000 times faster than texts. And that about 65% of humans are visually inclined.
To tell a climate-change-related data’s story beyond analysis and investigation, we needed to analyze trends and support decision making. Visualizing the information is necessary for practical data science — to explore the data, preprocess it, tune the model to the data, and ultimately to gain insights to take action.
No data story is complete without the inclusion of great visuals.
Understanding the impact of Nature-based solutions on climate change
The World Resources Institute (WRI) sought to understand the regional and global landscape of Nature-based Solutions (NbS).
How are some NbS platforms addressing climate hazards?
What type of NbS solutions are adapted?
What barriers and opportunities exist, etc.
The focus was initially on three platforms, AFR100, Cities4Forests, and Initiative20x20, and later scale the work to more platforms.
More than 30 Omdena AI engineers worked on this NLP problem to derive several actionable insights, develop a recommendation and Knowledge-based Q&A system to query the data from the NbS platforms, and extract sentiments from the data to find potential gaps. Topic Modeling was applied to derive dominant topics from the data, Website Network analysis of organizations, and statistical analysis helped to explore the involvement of `climate change impacts, ‘interventions’ and ‘ecosystems’ for the three platforms.
Using Streamlit, we built a highly interactive shareable web application (dashboard) to zoom into NLP results for actionable insights on Nature-based solutions. The Streamlit app was deployed to the web using Heroku. A major advantage of using Streamlit is that it allows developers to build a sophisticated dashboard with multiple elements, such as Plotly graph objects, tables, and interactive controlling objects, with Python scripts instead of additional HTML codes for further layout definition. This allows the incorporation of multiple project outputs on the same dashboard swiftly with minimal codes.
Overview of the Dashboard
The dashboard consists of five major sections of the results, where users can navigate across each section using the navigation pull-down menu on the left side-bar, and use other functionalities on the side-bar to select the content they would like to see. The following will describe the components in each of the sections.
Choropleth Map View
Choropleth maps use colors on a diverging scale to represent a changed situation. A diverging color scale for countries represents the magnitude of climate change over time.
The analysis considers yearly data of country-level climate and landscape parameters, such as land type cover, temperature, and soil moisture, across the major platforms’ participating countries. Deforestation evaluation used the Hansen and MODIS Land Cover Type datasets. The temperature change analysis used the MODIS Land Surface Temperature dataset. And the NASA-USDA SMAP Global Soil Moisture dataset was used to assess land degradation. Each year’s changes in the climate parameters are computed compared to the earliest year available in the data. The calculated changes each year are plotted on the choropleth maps based on the predefined diverging color scale, and users can select the year to view using the slider above the map on the dashboard.
Take the change in temperature across participating countries as an example. The graph shows that the average yearly temperature in most South American countries and Central-Eastern African countries in 2019 decreased by around 0.25 to 1.3 °C compared to 2015. In contrast, there is an increase in the heat level of participating countries in northern Africa and Mexico, where the temperature in these countries has increased compared to 2015. Such a difference in temperature change can therefore be easily represented by the diverging color scale, where red represents an increase in heat and blue represents a decline.
Heat Map View
Heat maps represent the intensity of attention from the nature-based solution platforms and how each of the climate risks matches with the NbS intervention across platforms. The two heat maps illustrate measurements of attention intensity from each NbS platform. The first is a document frequency and the second a calculation of hazard to ecosystem match scores. Users can filter their data visualization of interest using the checkbox on the sidebar, the pull-down menu on the top-left corner, and selecting the corresponding NbS platform.
As an example, the heatmap above shows the number of documents and websites related to climate impacts and the corresponding climate intervention strategies from the initiative 20×20 platform. Users can see that the land degradation problem has received the most attention from the platform, where restoration, reforestation, restorative farming, and agroforestry are the major climate intervention strategies that are correlated with the land degradation problem. Besides, the heatmap shows that the attention for the solutions for some climate risks such as wildfires, air and water pollution, disaster risk, bushfires, coastal erosion on the initiative 20×20 platform is relatively limited compared to other risks.
Apart from the heatmap itself, the dashboard design allows rooms for linking to external resources based on the information presented on the heatmap. Similar to the interactive tool in the Nature-based Solutions Evidence Platform by the University of Oxford where users can access the external cases by clicking on heatmaps, users can use the pull-down menus below the heatmap to browse the list of links and documents for each of the document numbers represented. For example, the attached figure shows the results when users select the restoration effort in response to land degradation on initiative 20×20, where users can read the brief descriptions of the page, the keywords and access the external site by clicking on the hyperlink.
Potential gap/solution identification
This section presents the results of our Sentiment analysis models. The goal was to identify which Projects / Publications / Partners of the major NbS platforms were addressing Potential Gaps or solutions for climate change. A Gap is a negative sentiment, which means it has some negative impact on climate change. Similarly, a solution is a positive sentiment, which implies that it has a positive impact on climate change. The output of this sentiment analysis subtask were three Hierarchical data frames, each on Projects, Publications, and Partners of AFR100, Initiative20x20, and Cities4forests. To present these huge data frames in a compact manner, we used Treemap and sunburst plots. Treemap charts visualize hierarchical data using nested rectangles. Sunburst plots visualize hierarchical data spanning outwards radially from root to leaves. The hierarchical grouping has been done based on the three platforms and then showing inside a platform which countries are there, and then the projects associated with them, and then if you click deeper, it shows the description and keywords for that project. The size of a rectangular box / Sector represents how much certain that there’s a potential gap/solution.
This pull-down tab consists of the network analysis and knowledge graphs. Knowledge Graphs (KGs) represent raw information(in our case texts from NbS platforms) in a structured form, capturing relationships between entities.
In Network analysis, concepts(nodes) are identified from the words in the text and the edges between the nodes represent relations between the concepts. The network can help one visualize the general structure of the underlying text in a compact way. In addition, latent relations between concepts become visible, which are not explicit in the text. Visualizing texts as networks allow one to focus on important aspects of the text without reading large amounts of the texts. Visuals for Knowledge graphs and Network Analysis can be seen in the GIF above.
Knowledge-based Question-Answer System
Knowledge-based Question & Answering NLP system aims to answer questions in the context of text scraped data from the NbS platform and PDF documents available on the NbS platform websites. The system is built on the open-source Deepset.ai Haystack framework and hosted on a virtual machine, accessible via REST API and the Streamlit Dashboard.
The recommendation system uses content-based filtering or collaborative filtering. Collaborative Filtering uses the “wisdom of the crowd” to recommend items. Our collaborative recommendations are based on indicators from World bank data and keyword similarity using the Starspace model by Facebook. In the dashboard, one can select multiple indicators for a platform and platforms related to the selected one
Content-based filtering recommendation is based on the description of an item and a profile of the user’s preference.
Content-based filtering guesses similar organizations, projects, news articles, blog articles, publications, etc. for a selected organization. The starspace model was used to get the word embeddings, and then a similarity analysis was done comparing the description of the selected organization and all the other organization’s data sets. Different Projects, Publications, News articles, etc. can be selected as options, using which related organizations can be recommended.
Keyword Analysis of Partner Organizations
This section includes an intuitive 3D t-SNE visualization of all keywords/topics in the 12801 unique URLS inside 34 Partner organization websites. The goal of each organization as displayed in the hover label was the output from Topic modeling with Latent Dirichlet Allocation (LDA).
What is a t-SNE plot?
t-SNE is an algorithm for dimensionality reduction that is well-suited for visualizing high dimensional data. TSNE stands for t-distributed Stochastic Neighbor Embedding. The idea is to embed high-dimensional points in low dimensions in a way that respects similarities between points.
We got the embeddings for every URL’s entire texts using the widely known Sentence Transformer by HuggingFace. These high dimensional embeddings were used as input to the t-SNE model which gave output projections in 3 dimensions. These projections are seen below in the interactive 3D visualization.
Advantages of this visual?
There were 12801 URLs under these 34 organizations, going through all of them and figuring out what each URL talks about would take a huge amount of time, as some websites themselves had nearly 1M words in their About section. This visual can be of help for anyone who wants to know what’s being discussed by each organization without having to manually go through those URL’s descriptions.
Today, data visualization has become an essential part of the story, no longer a pleasant enhancement but adding depth and perspective to a story. For our case, Geo-plots, heatmaps, network diagrams, Treemaps, drop down and filter elements, 3D interactive plots guide the reader step-by-step through the narrative.
We have only explored a few visuals from the multitude available and developed by the Omdena Data Science enthusiasts. With the Visual Dashboard we hope to provide a more robust connection between critical insights about Nature-based Solutions and their adaptation to the viewer. The dashboard is portable and can be shared amongst the climate change community, driving engagement, and birthing new ideas.
How to create a powerful NLP Q&A system in 8 weeks, that resolves queries on a domain-specific knowledge base?
Authors: Aruna Sri T., Tanmay Laud
The above snapshot is a classic example of how our Q&A system works. Users can ask questions pertaining to climate hazards (or their solutions) and the model will carve out relevant answers from the collected Knowledge Base.
The model suggests ‘Mangrove forests’, ‘improved river-flow regulation’, and ‘stone bunds’ to tackle floods.
What’s more? We also pinpoint the context that was used to generate the answer along with a hyperlink for those who want to go into more depth (panaroma.solutionsin this case).
Let’s explore why we built this system in the first place ?
Why Q&A Systems for Nature-based Solutions?
World Resources Institute (WRI) seeks to understand how nature-based solutions (NbS) like forest and landscape restoration can minimize the impacts of climate change on local communities.
Nature-based Solutions (NbS) are a powerful ally to address societal challenges, such as climate change, biodiversity loss, and food security. As the world strives to emerge from the current pandemic and move towards the UN Sustainable Development Goals, it is imperative that future investments in nature reach their potential by contributing to the health and well-being of people and the planet.
There is a growing interest from governments, business, and civil society in the use of nature for simultaneous benefits to biodiversity and human well-being.
The websites of the initiatives that advocate for NbS are full of information like annual reports, climate risks of the region, impact in the region, how the adaptations are helping in mitigating climate effects, challenges faced, local factors, socio-economic conditions, employment, investment opportunities, government policies, community involvement, inter-platform efforts, etc, which takes a lot of manual effort for studying. The QnA system helps to ease this effort in getting the preliminary information on the focus areas and the reader can then go to specific documents that have relevant answers.
Thus, our system is not a replacement for the human study or analysis, but a tool that highlights the focal points of key information by giving quick and relevant answers from the curated Knowledge Base.
The tool shall provide Climate Adaptation researchers an ability to search a massive quantity of textual information and analyze the impact of NbS on society, ecosystems. Broadly, it answers questions such as
How are regional/global platforms addressing climate change impacts?
What is the current state of landscapes, barriers, and opportunities?
When and how are the NbS being implemented and in which regions?
But, how did we do it? Let us deep-dive into the details of our project.
Data Collection and Preprocessing
The focus of this project was to study the following Coalitions that bring together organizations to promote forest and landscape restoration, enhancing human well-being:
• AFR100 (the African Forest Landscape Restoration Initiative to place 100 million hectares of land into restoration by 2030)
• Initiative 20×20 (the Latin American and Carribean initiative to protect and restore 50 million hectares of land by 2030)
• Cities4Forests (leading cities partnering to combat climate change, protect watersheds and biodiversity, and improve human well-being)
In this article, we shall focus on the PDF text scraping pipeline and SOTA Knowledge-Based Question Answers system.
Deep-dive into PDF text scraping
The websites on our radar have a good number of PDF documents that contain a plethora of information like annual reports, case studies, and research (insights) which is useful to policymakers and domain experts to understand the latest trends across the world. This text extraction process is automated by using GROBID.
GROBID (or GeneRation Of BIbliographic Data) is a machine learning library for extracting, parsing, and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. TIE XML is the industry standard for document content without presentation part. We downloaded and ran the open-source server on our local system. In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author names, affiliation types, detailed address, journal, volume, issue, etc.) to full-text structures (section title, paragraph, reference markers, head/foot notes, figure headers, etc.)
The steps are as follows:
The pipeline takes a list of PDF URLs and generates a consolidated CSV file containing paragraph text from all the PDF documents along with metadata such as the Source system, the download link to PDF, Title of paragraphs, etc.
All the PDF documents in the three platform websites and some documents which have generic information on nature-based solutions are included
The documents provided are then converted to TIE XML format using GROBID service.
These TIE XML documents are further scraped for headers and paragraphs using python’s Beautiful Soup parser.
This utility replaces the manual effort of extracting paragraph text from PDF documents which can be used for further analysis by downstream NLP models.
A Knowledge-Based Question Answering System
The knowledge-based Question & Answering (KQnA) NLP system aims to answers the questions in the domain context on the text scraped data from PDF publications and Partner websites.
The KQnA system is based on Facebook’s Dense Passage Retrieval (DPR) method. Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. DPR embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy.
Retrieval-Augmented Generation (RAG) model combines the powers of pre-trained dense retrieval (DPR) and sequence-to-sequence models. RAG model retrieves documents, passes them to a seq2seq model, then marginalizes them to generate outputs. The retriever and seq2seq modules are initialized from pre-trained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks. It is based on the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
Building the above system from scratch has its own set of challenges as the process is time-consuming and prone to bugs. Hence, we utilized an open-source framework called haystack provided by deepset.ai for our QnA pipeline. Let us go through our custom model pipeline in detail.
The model is powered by a Retriever-Reader pipeline in order to optimize for both speed and accuracy.
Elastic Search on VM was used as a Document store for the storage of documents. The PDF and Website text are stored on two different indices on Elastic search to run on-demand queries on the two streams separately. Filters of Elastic search can be applied on the platform/Url for the focussed search on a particular platform or website. Elastic search 7.6.2 was installed on VM which is compatible with Haystack (as of December 2020).
Readers are powerful models that do a close analysis of documents and perform the core task of question answering.
We used the FARM Reader, which makes Transfer Learning with BERT & Co simple, fast, and enterprise-ready. It’s built upon transformers and provides additional features to simplify the life of developers: Parallelized preprocessing, highly modular design, multi-task learning, experiment tracking, easy debugging, and close integration with AWS SageMaker. With FARM you can build fast proofs-of-concept for tasks like text classification, NER, or question answering and transfer them easily into production.
The Retriever assists the Reader by acting as a lightweight filter that reduces the number of documents that the Reader has to process. It does this by:
Scanning through all documents in the database
Quickly identifying the relevant and dismissing the irrelevant
Passing on only a small candidate set of documents to the Reader
RAG is applied to the generated answers to get a specific answer.
The complete pipeline is as follows:
The model is hosted on Omdena AWS server and exposed via a REST API. We used Streamlit to produce a fast and scalable dashboard.
We started this article with an example question around floods but the target search was on a website (panaroma.solutions). The above animation presents a similar question, but this time, the search is performed on PDFs. We get an answer “moving flood protection infrastructures” and “levees and dams”.
A more refined search such as “What is the best way to tackle floods?” returns the following:
We observe that the system does not answer “levees and dams” since those are not nature-based solutions. Instead, we get documents that talk about ‘ecosystem-based coastal defenses’
Let us look at a few more examples:
“What is the impact of NbS ?” — search all websites
“What is the impact of NbS ?” — search on climatefocus.com
Here, we can understand the general notion around NbS and compare that to content from specific websites such as climatefocus.com. It allows us to contrast the work being done by different websites and how they approach the “NbS”. (In comparison to the earlier query for all websites, now it is run for a single website)
NOTE: The questions posted above are just samples and do not directly indicate an outcome from the Omdena project. The work is a Proof-Of-Concept and requires fine-tuning and analysis by domain experts before productionizing the system.
The KQnA system is found to give pertinent answers to the questions based on context. Its performance is dependent on two factors:
The quality of the data ingested
The framing of the question ( It may require a few attempts to ask the question in the right way in order to get the answer you are looking for)
The system can be enhanced by adding more data from PDF documents and websites to the database. Further, if a labeled dataset for climate change is available in the future, then the model can be fine-tuned so that it can better predict the documents and extract the right answers from the database in the context of Climate Change and/or Nature-Based Solutions.