Topic Analysis to Identify and Classify Environmental Policies in LATAM

Topic Analysis to Identify and Classify Environmental Policies in LATAM

By Gijs van den Dool, Galina Naydenova, and Ann Chia


In an 8-week project, 50 technology changemakers from Omdena embarked on a mission to find needles in an online haystack. The project proved that using Natural Language Processing (NLP) can be very efficient to point to where these needles are hiding, especially when there are (legal) language barriers, and different interpretations between countries, regions, governmental institutes.




The World Resource Institute (WRI) identified the problem and asked Omdena to help solve it. The project was hosted on Omdena´s platform to create a better understanding of the current situation regarding enabling policies through NLP techniques like topic analysis. Policies are one of the tools decision-makers can use to improve the environment, but often it is not known which policies and incentives are in place, and which department is responsible for the implementation.

Understanding the effect of the policies involves reading and topic analysis of thousands of pages of documentation (legislation) across multiple sectors. It is precisely in this area where Natural Language Processing (NLP) can help, and assist, in the processing of policy documents, highlighting the essential documents and parts of documents, and identifying which areas are under/over-represented. A process like this will also promote the knowledge sharing between stakeholders, and enable rapid identification of incentives, disincentives, perverse incentives, and misalignment between policies.


Problem Statement

This project aimed to identify economic incentives for forest and landscape restoration using an automated approach, helping (for a start) policymakers in Mexico, Peru, Chile, Guatemala, and El Salvador to make data-driven choices that positively shape their environment.

The project focused on three objectives:

  • Identifying which policies relate to forest and landscape restoration using topic analysis
  • Detecting the financial and economic incentives in the policies via topic analysis
  • Creating visualization which clearly shows the relevance of policies to forest and landscape restoration

This was achieved through the following pipeline, demonstrated through Figure 1 below:


Figure 1: NLP Pipeline


The Natural Language Processing (NLP) Pipeline

The web scraping process consisted of two approaches: the scraping of official policy databases, and Google Scraping. This allowed the retrieval of virtually all official policy documents from the five listed countries roughly between 2016 and 2020. The scraping results were then filtered further by relevance to landscape restoration, and the final text metadata of each entry was then stored on PySQL. Thus, we were able to build a comprehensive database of policy documents for use further down the pipeline.

Text preprocessing converted the retrieved documents from a human-readable form to a computer-readable form. Namely, policy documents were converted from pdf to txt, with text contents tokenized, lemmatized, and further processed for use in the subsequent NLP models.

NLP modeling involved the use of Sentence-BERT (SBERT) and LDA topic analysis. SBERT was used to build a search engine that parses policy documents and highlights relevant text segments that match the given input search query. The LDA model was used for topic analysis, which will be the focus of this economic policies analysis article.

Finally, the web scraping results, SBERT search engine, and in the future, the LDA model outputs would be combined and the results presented into an interactive web app, allowing greater accessibility to the non-technical audience.



Applications for Natural Language Processing

All countries are creating policies, plans, or incentives, to manage land use and the environment and are part of the decision making process. Governments are responsible for controlling the effects of human activities on the environment, particularly those measures that are designed to prevent or reduce harmful effects of human activities on ecosystems, and do not have an unacceptable impact on humans. This policy-making can result in the creation of thousands of documents. The idea is to extract the economic incentives for forest and landscape restoration from the available (online) policy documents to get a better understanding of what kind of topics are addressed in these policies via topic analysis.

We developed a two-step approach to solving this problem: the first step selects the documents that are most closely related to reforestation in a general sense, and the second step points out the segments of those documents stating economic incentives. To mark which policies are relating to forest and landscape restoration we use a scoring technique (SBERT), to find the similarity between the search statement and sentences in a document, and a Topic Modelling technique (LDA), to pick out the parts in a document to create a better understanding of what kind of topics are addressed in these policies.



Analyzing the Policy Fragments with Sentence-BERT (SBERT)

To analyze all the available documents, and to identify which policies relate to Forest and Landscape Restoration, the documents are broken down into manageable parts and translated to one common language.

How can we compare different documents written in different languages and using specific words in each language?

The Multilingual Universal Sentence Encoder (MUSE) is one of the few algorithms specially designed to solve this problem. The model is simultaneously trained on a question answering task, (translation ranking task), and a natural language inference task (determining the logical relationship between two sentences). The translation task allows the model to map 16 languages (including Spanish and English) into a common space; this is a key feature that allowed us to apply it to our Spanish corpus.

The modules in this project are trained on the Spanish language, and due to the modular nature of the infrastructure this language can be easily switched back to the native language (English) in SBERT, subsequently, this project is working with a database of policy documents in Spanish but will work with any language base (Figure 2).



Figure 2: Visualisation of SBERT model, in Spanish.



Analyzing the Policy Landscape

Collecting all available online policies, by web scraping, in a country can result in a database of thousands of documents, and millions of text fragments, all contributing to the policy landscape in the country or region.

When we are faced with thousands of potentially important documents, where do we start from?

We have several options to solve this problem, for example, we can select a couple of documents and start from there. Of course, we can read the abstract if one such exists, but in real life, we may not be that lucky.

Another approach is using the bag-of-words algorithm; this is a simple technique that counts the frequency of the words in a text, allowing to deduce the content of the text from the highest-ranking words. (In this project we used CountVectorizer from sklearn to get the document-term matrix), which can then be displayed in a word cloud (using Wordcloud), for an easy, one-look summary of the document, like the one below.

This way we can get a quick answer to the question “What is the document about?”.

However, faced with thousands of documents, it is impractical to do word clouds for them individually. This is where topic modeling comes in handy. Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation (LDA) is a popular algorithm for topic modeling.

The LDA model is a topic classification model developed by Prof. Andrew Ng et al. of Stanford University’s NLP Laboratory. It is a generative model for text and other forms of discrete data that generalizes and improves upon previous models of the past, such as Bayes, unigram, and N-gram models.

Here’s how it works: Consider a corpus that comprises a collection of M documents, and each document formed by a selection of words (w1 w2, …, wi, …, wn). Additionally, each word belongs to one of the topics in the collection of topics (z1, z2, …, zi, …, zk). By estimating machine-learning weighted parameters, the per-document topic distributions, the per-document word distributions, and the topic distribution for a document, we can calculate the probabilities to which certain words are associated with certain topics, characterizing the topics and word distributions. Then, we can generate a distribution of words for each topic.

The LDA package outputs models with different values of the number of topics (k), each giving a measure of topic coherence value, a rough guide of how good a given topic model is.


Figure 4: Coherence score vs. Number of topics


In this case, we picked up the one that gives the highest coherence value, without giving too many or too few topics, that would mean either not being granular enough, or difficult to interpret. ‘K’=12 marks the point of a rapid increase of topic coherence; a usual sign of meaningful and interpretable topics.

For each topic, we have a list of the highest-frequency words constructing the topic, and we can see some overarching themes appearing. Naming the topic is the next step, with the explicit caveat that setting the topic name is massively subjective, and the assistance of a subject matter expert is advisable. The knowledge of topics, and the keywords, is necessary because the topic should reflect the different aspects of the issues within the study or problem. For example, forest restoration can be seen as operating in the intersection of the following themes, defined by the LDA. Below is an example of a model with 12 topics, which happened to be the one with the most coherence, and the subjectively determined Topic Labels (Table 1).


Table 1. Topic labels (12) and their respective keywords in the selected LDA model


We can see that one of the topics, “Forestry and Resources”, reflects closely the topics we are interested in, so the documents within it may be of particular relevance. The example document we saw before, “Sembrando Vida”, was assigned topic 8: “Development”, which is what it is expected from a document outlining the details of a broad incentive program. Some of the topics (e.g. Environmental, Agriculture) are related to the narrow topic of interest, whereas others (e.g. Food Production) are more on the periphery, and documents with this topic can be put aside for the time being. Thus topic modeling allows sifting the wheat from the chaff and zooming straight into more relevant documents.

The challenge of LDA is how to extract good quality topics that are clear, segregated, and meaningful. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics, as well as the subject knowledge. Being familiar with the context and themes, as well as with different types of documents, is essential for this. Followed up with data visualizations, and further processing, like comparison, identifying conflicts between ministries, change of theme over time, zooming into the document, etc.




The LDA process results in a table of topics defined by user-generated tags, and this table can be used to create a heat map (showing the frequency of the mentioning of a topic by country) and used for further evaluation of how, for example, the policies are differentiating between topics and regions; this process is illustrated in Figure 5.



Figure 5: LDA model visualization



Heat maps

Based on this, the following visualization is generated (Figure 5). The horizontal axis contains the different topic labels in Table 1, while the vertical axis lists three countries: Mexico, Peru, and Chile. The heat map gives us insights into the different levels of categorical policy present in the three countries; for instance, a territorial-related policy is widely prevalent in Mexico, but not adopted widely in Chile or Peru.

This allows policymakers to observe the decisions made by other countries and how it compares to their local administration, enabling them to make better-informed choices in domestic policy that are supported by data-driven evidence.


Figure 6: Heatmap displaying the frequency of appearance of LDA-defined policy topics by country



Next Steps

A valuable further development of topic analysis is to display policies (y) topics by the originator (ministries, etc.) to identify possible overlap and conflicts and to display change of topics in legislation and shifting focus over time. Going further into the documents, LDA can also be used to map out the topics in the different paragraphs, shifting the specific from generic information and identifying paragraphs of particular relevance. By zooming into specific documents, and then into specific document paragraphs, LDA is an efficient and flexible solution when faced with a huge volume of unclassified documents.



Conclusion: Topic Analysis for Policies

Finding needles in an online haystack is possible, especially with the help of the tools discussed, starting from a collection of web scraped documents, going through a data engineering process to clean up the found documents, and using the Latent Dirichlet Allocation (LDA) method to structure the documents, and fragments, by topics.

The data view by topic is a powerful way to see directly where what kind of policy is most dominant, and this information can be used to refine the search further or assist the policy-makers in defining the most efficient use of policies to create an environment where new policies are contributing to Forest and Landscape Restoration.

In the visualization space, possible enhancements include identifying overlaps and conflicts between government entities, highlighting the active policy areas, and displaying financial incentive information and projections.

In summary, the use of LDA is a promising way to navigate through complex environmental legislation systems and to retrieve relevant information from a vast compilation of legal text, from different sources, in multiple languages, and in quality.

Machine Learning For Rooftop Detection and Solar Panel Installment

Machine Learning For Rooftop Detection and Solar Panel Installment

By Harshita Chopra


Solar energy is a promising and freely available resource for managing the forthcoming energy crisis, without hurting the environment. Unlike conventional fossil fuels, it won’t run out anytime soon.

Fact — There’s enough solar energy hitting the Earth every hour to meet all of humanity’s power needs for an entire year.

Let’s face it, what’s cooler than the sun powering your home? And that is quite literally true.

Fun fact — Solar panels also act as “roof shades” to keep buildings cool. They absorb the sun’s rays, directing them away from the roof, whereas a roof without panels would allow heat to penetrate into the building.

As people around the world look for ways to “go green” and protect the earth, solar panels provide an excellent option. But the utility industry needs smart systems that can help improve the integration of renewables in an effective way.

Solar AI, a Singapore based startup incubated as a part of ENGIE Factory, collaborated with Omdena, to pull off a mission to hyper-scale the deployment of distributed solar and the transition towards 100% renewables by modernizing the way rooftop solar is sold.


The problem statement

The rooftop solar assessment process can be time consuming and expensive, taking anywhere between 1 hour to 2 full days to calculate the solar potential of each rooftop. In the solar industry, this has resulted in the cost of sales taking up to 30–40% of total project costs, significantly worsening the unit economics of solar projects.

By automating these evaluations with Artificial Intelligence, Solar AI aims to drastically reduce the cost of this process and make this information easily available for both building owners as well as solar energy companies.

So we had a mission to accomplish within eight weeks:

Combining multiple models that can automatically identify rooftops and detect rooftop features using machine learning like obstacles, material, slopes and area from high-resolution satellite imagery.


The solution

Solar AI provided us with high-resolution satellite imagery in Singapore. With these huge and detailed images in hand, we had a list of tasks to perform.

The 2 GB size of one image fascinated me enough, to begin with pre-processing and creating thousands of smaller tiles out of it — using just a few lines of code bundled up in a function.


Snapshot of a few tiles created from the huge image / Source:



The power of annotations

Even the most technically advanced algorithms cannot address or solve a problem without the right data. We know having access to data is quite valuable, but having access to data with a learnable structure is the biggest competitive advantage nowadays. That’s the power of data annotation.


A quirky image with hundreds of rooftops / Source:


Our wonderful team of collaborators volunteered to annotate thousands of rooftops in 500+ tiles. We pulled off a smarter method of annotating the buildings, by mapping the OSM data on the raster layer (TIF format tile) in the QGIS software.

The consistent determination of the annotators resulted in a perfectly labeled dataset for Supervised Machine Learning algorithms.

The food for models was ready!


Scanning images of rooftops via machine learning 

The major task was to detect rooftops in a given image using machine learning & computer vision models.

Not just this, we also had to determine their type/structure such as Flat-roof, Hip-roof, Shed-roof, or any other. Hence, this became an instance segmentation problem.

We tried out a number of models such as Mask R-CNN, YOLACT (You Only Look At CoefficienTs), Dectectron2, and more. After training on different batches of annotations as they were delivered, we kept seeing improvement in results. Eventually, the best performing model was selected to go ahead with other tasks.




Zooming in on your rooftops 

Now that we had the bounding boxes and mask contours of various rooftops, trapped properly in a data frame, we were ready to start the analysis of individual rooftops. After extracting and zooming into masks of each detected roof, we needed the following attributes:

  • Obstacle Detection
  • Area of the roof (excluding obstacles)
  • The material of the roof
  • Detecting faces of Hip/Shed roof
  • The orientation of individual slopes


Calculating “Area Available” for panels

For the calculation of a rooftop’s effective area, the area occupied by obstacles has to be subtracted from the whole. So that gives rise to the task of identifying obstacles.

Due to the lack of labeled data for obstacle detection, our genius team shifted their thought process towards an unsupervised approach of edge detection and creating contours. By setting a threshold on contour colors, obstacles were distinguished from plain area to a great extent.

An effective area was therefore mathematically calculated as the difference between total area and obstacle area in terms of pixels, which was then converted into meters squared.


Roof Materials / Source:


Quality of the roof

Because solar panels are installed on your home’s rooftop, it is important to understand how different roof materials may influence this process.

Generally, they range from concrete, metal, roof tiles, eternit to composite shingles.

This task also required a labeled dataset, so I decided to jump in to find a solution where we could skip annotations per se. Using Open Street Maps, we created a small but fruitful dataset labeled with roofs and their materials. A deep learning-based Image Classification model was then created which identifies the material of the roof and gives the probability scores for each class.


Which way do solar panels face?




Orientation, or the direction your roof faces, may have a large impact on how productive roof-mounted solar panels will be. Your system will generate the most energy when it gets as many hours of light exposure per day as possible. In most places, the ideal power generation angle is 30–40 degrees.




The task of identifying many faces of a hip roof was a challenging one. After multiple attempts with different approaches, the task team managed to create an appreciable mathematical model that could identify the facets as well as the angles they’re inclined on, using some constructive utility functions. The output was the orientation of different roof facets.


Conclusion: Putting it all together

The outputs of all the tasks were captured systematically in a data frame. Keeping in mind that we computed various attributes based on pixel values, we converted them back to geographic coordinates at the end. This allowed us to project the data on satellite images of a particular CRS (Coordinate Reference System).

After merging everything into an automated pipeline and many rounds of reviews, evaluation, fixing bugs, and testing — our software was ready to be delivered.

Solar AI is extremely happy with the final deliverables, and this is something that makes the experience even more worthwhile. As CEO Bolong Chew puts its:

“This work went beyond our wildest expectations and we’re extremely happy. We set the bar really high and the team delivered. It was an amazing experience.”

Augmenting Public Safety Through AI and Machine Learning

Augmenting Public Safety Through AI and Machine Learning

In this demo day, we took a close look at the tremendous potential AI offers for making communities safer, by helping to reduce, prevent, and respond to crimes. When it comes to public safety, it is often critical to act quickly. AI technologies can supplement the work of people, taking on monotonous and time-consuming tasks that would be impossible for humans to do effectively. Natural language processing can read and analyze public communications and news reports to detect potential problem areas and get-ahead of violence. Of course, this work must be done responsibly and ethically.

Sharing her perspective on the impact that AI can have in keeping people safe was an expert in the field, ElsaMarie D’Silva, the Founder & CEO of the Red Dot Foundation. The Red Dot Foundation’s award-winning platform Safecity crowdsources personal experiences of sexual violence and abuse in public spaces. ElsaMarie is listed as one of BBC Hindi’s 100 Women, and her work has been recognized by numerous UN organizations and the SDG Action Festival.

To go a little deeper into the application of AI for public safety, we shared Omdena projects that took innovative approaches to make communities safer.


Case Study 1: Preventing sexual harassment through a safe-path finder algorithm

UN Women states that 1 in 3 women face some kind of sexual assault at least once in their lifetime.”

With the first case study, the Omdena team drew upon Safecity’s crowdsourced data about sexual harassment in public spaces and leveraged open-source data to build heatmaps and calculate safe routes through major cities in India. Part of the solution is a sexual harassment category classifier with 93 percent accuracy and several models that predict places with a high risk of sexual harassment incidents to suggest safe routes.


AI Sexual Harassment



You can learn more about this and related projects here:


Case Study 2: Understanding gang violence patterns and actors through Twitter analysis

Our team worked in partnership with Voice 4 Impact, an award-winning NGO whose solution to violence in our communities addresses the questions people worldwide are asking: “How do we keep missing the signs?”

The Omdena team made use of natural language processing techniques — AI techniques that analyze text to understand what is being communicated. Machine learning algorithms were used to understand gang language and AI models built to detect violent messages on Twitter, without profiling. The aim is to predict and ultimately prevent, gang violence.


AI Gang Violence


You can learn more about this and related projects here:


Case Study 3: Analyzing Domestic Violence through Natural Language Processing (NLP)

Finally, we presented Omdena’s work to uncover domestic violence in India hidden due to COVID lockdowns. This work is part of a project with the award-winning Red Dot Foundation and Omdena’s collaborative platform to build solutions to better understand domestic violence and online harassment patterns during COVID-19. The project used natural language processing techniques with social media, government reports, and other text content to create a dataset with which Safecity could mobilize local efforts to protect and support domestic violence victims.



AI Domestic Violence



You can learn more about this and related projects here:





Host an AI project with us.


Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

By Laura Clark Murray, Nikhel Gupta, Joanne Burke, Rishika Rupam, Zaheeda Tshankie


Download the PDF version of this whitepaper here.

Project Overview

This project aimed to provide a proof-of-concept machine-learning-based methodology to identify land conflicts events in geography and match those events to relevant government policies. The overall objective is to offer a platform where policymakers can be made aware of land conflicts as they unfold and identify existing policies that are relevant to the resolution of those conflicts.

Several Natural Language Processing (NLP) models were built to identify and categorize land conflict events in news articles and to match those land conflict events to relevant policies. A web-based tool that houses the models allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The geographic scope of the project was limited to India, which has the most environmental (land) conflicts of all countries on Earth.



Degraded land is “land that has lost some degree of its productivity due to human-caused process”, according to the World Resources Institute. Land degradation affects 3.2 billion people and costs the global economy about 10 percent of the gross product each year. While dozens of countries have committed to restore 350 million hectares of degraded land, land disputes are a major barrier to effective implementation. Without streamlined access to land use rights, landowners are not able to implement sustainable land-use practices. In India, where 21 million hectares of land have been committed to the restoration, land conflicts affect more than 3 million people each year.

AI and machine learning offer tremendous potential to not only identify land-use conflicts events but also match suitable policies for their resolution.


Data Collection

All data used in this project is in the public domain.

News Article Corpus: Contained 65,000 candidate news articles from Indian and international newspapers from the years 2008, 2017, and 2018. The articles were obtained from the Global Database of Events Language and Tone Project (GDELT), “a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages.” All the text was either originally in English or translated to English by GDELT.

  • Annotated Corpus: Approximately 1,600 news articles from the full News Article Corpus were manually labeled and double-checked as Negative (no conflict news) and Positive (conflict news).
  • Gold Standard Corpus: An additional 200 annotated positive conflict news articles, provided by WRI.
  • Policy Database: Collection of 19 public policy documents related to land conflicts, provided by WRI.




Text Preparation


In this phase, the articles of the News Article Corpus and policy documents of the Policy Database were prepared for the natural language processing models.

The articles and policy documents were processed using SpaCy, an open-source library for natural language processing, to achieve the following:

  • Tokenization: Segmenting text into words, punctuation marks, and other elements.
  • Part-of-speech (POS) tagging: Assigning word types to tokens, such as “verb” or “noun”
  • Dependency parsing: Assigning syntactic dependency labels to describe the relations between individual tokens, such as “subject” or “object”
  • Lemmatization: Assigning the base forms of words, regardless of tense or plurality
  • Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.
  • Named Entity Recognition (NER): Labelling named “real-world” objects, like persons, companies, or locations.


Coreference resolution was applied to the processed text data using Neuralcoref, which is based on an underlying neural net scoring model. With coreference resolution, all common expressions that refer to the same entity were located within the text. All pronominal words in the text, such as her, she, he, his, them, their, and us, were replaced with the nouns to which they referred.


For example, consider this sample text:

“Farmers were caught in a flood. They were tending to their field when a dam burst and swept them away.”

Neuralcoref recognizes “Farmers”, “they”, “their” and “them” as referring to the same entity. The processed sentence becomes:

Farmers were caught in a flood. Farmers were tending to their field when a dam burst and swept farmers away.”


Coreference resolution of sample sentences



Document Classification


The objective of this phase was to build a model to categorize the articles in the News Article Corpus as either “Negative”, meaning they were not about conflict events, or “Positive”, meaning they were about conflict events.

After preparation of the articles in the News Article Corpus, as described in the previous section, the texts were then prepared for classification.

First, an Annotated Corpus was formed to train the classification model. A 1,600 article subset of the News Article Corpus was manually labeled as “Negative” or “Positive”.

To prepare the articles in both the News Article Corpus and Annotated Corpus for classification, the previously pre-processed text data of the articles was represented as vectors using the Bag of Words approach. With this approach, the text is represented as a collection, or “bag”, of the words it contains along with the frequency with which each word appears. The order of words is ignored.

For example, consider a text article comprised of these two sentences:

Sentence 1: “Zahra is sick with a fever.”

Sentence 2: “Arun is happy he is not sick with a fever.”

This text contains a total of ten words: “Zahra”, “is”, “sick”, “happy”, “with”, “a”, “fever”, “not”, “Arun”, “he”. Each sentence in the text is represented as a vector, where each index in the vector indicates the frequency that one particular word appears in that sentence, as illustrated below.




With this technique, each sentence is represented by a vector, as follows:

“Zahra is sick with a fever.”

[1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

“Arun is happy he is not sick with a fever.”

[0, 2, 1, 1, 1, 1, 1, 1, 1, 1]

With the Annotated Corpus vectorized with this technique, the data was used to train a logistic regression classifier model. The trained model was then used with the vectorized data of the News Article Corpus, to classify each article into Positive and Negative conflict categories.

The accuracy of the classification model was measured by looking at the percentage of the following:

  • True Positive: Articles correctly classified as relating to land conflicts
  • False Positive: Articles incorrectly classified as relating to land conflicts
  • True Negative: Articles correctly classified as not being related to land conflicts
  • False Negative: Articles incorrectly classified as not being related to land conflicts


The “precision” of the model indicates how many of those articles classified to be about the land conflict were actually about land conflict. The “recall” of the model indicates how many of the articles that were actually about the land conflict were categorized correctly. An f1-score was calculated from the precision and recall scores.

The trained logistic regression model successfully classified the news articles with precision, recall, and f1-score of 98% or greater. This indicates that produced a low number of false positives and false negatives.


Classification report using a test dataset and logistic regression model



Categorize by Land Conflicts Events

The objective of this phase was to build a model to identify the set of conflict events referred to in the collection of positive conflict articles and then to classify each positive conflict article accordingly.

A word cloud of the articles in the Gold Standard Corpus gives a sense of the content covered in the articles.

A topic model was built to discover the set of conflict topics that occur in the Positive conflict articles. We chose a semi-supervised approach to topic modeling to maximize the accuracy of the classification process. We chose to use CorEx (Correlation Explanation), a semi-supervised topic model that allows domain knowledge, as specified by relevant keywords acting as “anchors”, to guide the topic analysis.

To align with the Land Conflicts Policies provided by WRI, seven relevant core land conflicts topics were specified. For each topic, correlated keywords were specified as “anchors” for the topic.




The trained topic model provided 3 words for each of the seven topics:

  • Topic #1: land, resettlement, degradation
  • Topic #2: crops, farm, agriculture
  • Topic #3: mining, coal, sand
  • Topic #4: forest, trees, deforestation
  • Topic #5: animal, attacked, tiger
  • Topic #6: drought, climate change, rain
  • Topic #7: water, drinking, dams

The resulting topic model is 93% accurate. This scatter plot uses word representations to provide a visualization of the model’s classification of the Gold Standard Corpus and hand-labeled positive conflict articles.


Visualization of the topic classification of the Gold Standard Corpus and Positive Conflict Articles



Identify the Actors, Actions, Scale, Locations, and Dates

The objective of this phase was to build a model to identify the actors, actions, scale, locations, and dates in each positive conflict article.

Typically, names, places, and famous landmarks are identified through Named Entity Recognition (NER). Recognition of such standard entities is built-in with SpaCy’s NER package, by which our model detected the locations and dates in the positive conflict articles. The specialized content of the news articles required further training with “custom entities” — those particular to this context of land conlficts.

All the positive conflict articles in the Annotated Corpus were manually labeled for “custom entities”:

  • Actors: Such as “Government”, “Farmer”, “Police”, “Rains”, “Lion”
  • Actions: Such as “protest”, “attack”, “killed”
  • Numbers: Number of people affected by a conflict

This example shows how this labeling looks for some text in one article:



These labeled positive conflict articles were used to train our custom entity recognizer model. That model was then used to find and label the custom entities in the news articles in the News Article Corpus.


Match Conflicts to Relevant Policies

The objective of this phase was to build a model to match each processed positive conflict article to any relevant policies.

The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.


Excerpt of a 2001 policy document related to agriculture



A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

Cosine similarity calculates similarity by measuring the cosine of an angle between two vectors. Using the vectorized text of the articles and the policy documents that had been generated in the previous phases as described above, the model generated a collection of matches between articles and policies.


Visualization of Conflict Event and Policy Matching

The objective of this phase was to build a web-based tool for the visualization of the conflict event and policy matches.

An application was created using the Plotly Python Open Source Graphing Library. The web-based tool houses the models and allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The map displays land conflict events detected in the News Article Corpus for the selected years and regions of India.

Conflict events are displayed as color-coded dots on a map. The colors correspond to specific conflict categories, such as “Agriculture” and “ Environmental”, and actors, such as “Government”, “Rebels”, and “Civilian”.

In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.




By selecting a particular category from the right column, only those conflicts related to that category are displayed on the map. Here only the Agriculture-related subset of the events shown in the previous example is displayed.



News articles from the select years and regions are displayed below the map. When a particular article is selected, the location of the event is shown on the map. The text of the article is displayed along with policies matched to the event by the underlying models, as seen in the example below of a 2018 agriculture-related conflict in the Andhra Pradesh region.



Here is a closer look at the article and matched policies in the example above.




Next Steps

This overview describes the results of a pilot project to use natural language processing techniques to identify land conflict events described in news articles and match them to relevant government policies. The project demonstrated that NLP techniques can be successfully deployed to meet this objective.

Potential improvements include refinement of the models and further development of the visualization tool. Opportunities to scale the project include building the library of news articles with those published from additional years and sources, adding to the database of policies, and expanding the geographic focus beyond India.

Opportunities to improve and scale the pilot project


  • Refine models
  • Further development of visualization tool


  • Expand library of articles with content from additional years and sources
  • Expand the database of policies
  • Expand the geographic focus beyond India



About the Authors

  • Laura Clark Murray is the Chief Partnership & Strategy Officer at Omdena. Contact:
  • Nikhel Gupta is a physicist, a Postdoctoral Fellow at the University of Melbourne, and a machine learning engineer with Omdena.
  • Joanne Burke is a data scientist with MUFG and a machine learning engineer with Omdena.
  • Rishika Rupam is a Data and AI Researcher with Tilkal and a machine learning engineer with Omdena.
  • Zaheeda Tshankie is a Junior Data Scientist with Telkom and a machine learning engineer with Omdena.


Omdena Project Team

Kulsoom Abdullah, Joanne Burke, Antonia Calvi, Dennis Dondergoor, Tomasz Grzegorzek, Nikhel Gupta, Sai Tanya Kumbharageri, Michael Lerner, Irene Nanduttu, Kali Prasad, Jose Manuel Ramirez R., Rishika Rupam, Saurav Suresh, Shivam Swarnkar, Jyothsna sai Tagirisa, Elizabeth Tischenko, Carlos Arturo Pimentel Trujillo, Zaheeda Tshankie, Gabriela Urquieta



This project was done in collaboration with Kathleen Buckingham and John Brandt, our partners with the World Resources Institute (WRI).



About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through global bottom-up collaboration. Omdena is a partner of the United Nations AI for Good Global Summit 2020.

Building a Risk Classifier for a PTSD Assessment Chatbot

Building a Risk Classifier for a PTSD Assessment Chatbot


MLFlow to structure a Machine Learning project and support the backend of the risk classifier chatbot regarding PTSD.



The Problem: Classification of Text for a PTSD Assessment Chatbot

The input

A text transcript similar to:


therapist and client conversation snapshot


The output

Low Risk -> 0 , High Risk -> 1

One of the requirements of this project was to have a productionized model for Text Classification regarding PTSD that could communicate with a frontend, for example, using Machine Learning.

As part of the solution to this problem, we decided to explore the MLFlow framework.



MLflow is an open-source platform to manage the Machine Learning lifecycle, including experimentation, reproducibility, and deployment regarding PTSD. It currently offers three components:

MLFlow Tracking: Allows you to track experiments and projects.

MLFlow Models: Provides a model and framework to persist, version, and serialize models in multiple platform formats.

MLFlow Projects: Provides a convention-based approach to set up your ML project to benefit the maximum work being put in the platform by the developer’s community.

Main benefits identified from my initial research were the following:

  • Work with any ml library and language
  • Runs the same way anywhere
  • Designed for small and large organizations
  • Provides a best practices approach for your ML project
  • Serving layers(Rest + Batch) are almost for free if you follow the conventions



The Solution


The focus of this article is to show the baseline ML models and how MLFlow was used to aid in Text Classification and training model experiment tracking and productionization of the model.


Installing MLFlow

pip install mlflow


Model development tracking

The snippet below represents our cleaned and pretty data, after data munging:


snapshot of a table containing transcript_id, text, and label as the column headings


In the gist below a description of our baseline(dummy) logistic regression pipeline:


train, test = train_test_split(final_dataset, 
random_state=42, test_size=0.33, shuffle=True)
X_train = train.text
X_test = test.text

LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, 
norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')),
The link to this code is given here.

One of the first useful things that you can use MLFlow during Text Classification and model development is to log a model training run. You would log for instance an accuracy metric and the model generated will also be associated with this run.


with mlflow.start_run():, train["label"])
# compute the testing accuracy
prediction = LogReg_pipeline.predict(X_test)
accuracy = accuracy_score(test["label"], prediction)
mlflow.log_metric("model_accuracy", accuracy)
mlflow.sklearn.log_model(LogReg_pipeline, "LogisticRegressionPipeline")


The link to the code above is given here.


At this point, the model above is saved and reproducible if needed at any point in time.

You can spin up the MLFlow tracker UI so you can look at the different experiments:


╰─$ mlflow ui -p 60000                                                                                                                                                                                                                  130 ↵
[2019-09-01 16:02:19 +0200] [5491] [INFO] Starting gunicorn 19.7.1
[2019-09-01 16:02:19 +0200] [5491] [INFO] Listening at: (5491)
[2019-09-01 16:02:19 +0200] [5491] [INFO] Using worker: sync
[2019-09-01 16:02:19 +0200] [5494] [INFO] Booting worker with pid: 5494


The backend of the tracker can be either the local system or a cloud distributed file system ( S3, Google Drive, etc.). It can be used locally by one team member or distributed and reproducible.

The image below shows a couple of models training runs in conjunction with the metrics and model artifacts collected:


Experiment Tracker in MLFlow screenshot

Sample of experiment tracker in MLFlow for Text Classification


Once your models are stored you can always go back to a previous version of the model and re-run based on the id of the artifact. The logs and metrics can also be committed to Github to be stored in the context of a team, so everyone has access to different experiments and resulted in metrics.


MLFlow experiment tracker


Now that our initial model is stored and versioned we can assess the artifact and the project at any point in the future. The integration with Sklearn is particularly good because the model is automatically pickled in a Sklearn compatible format and a Conda file is generated. You could have logged a reference to a URI and checksum of the data used to generate the model or the data in itself if within reasonable limits ( preferably if the information is stored in the cloud).


Setting up a training job

Whenever you are done with your model development you will need to organize your project in a productionizable way.

The most basic component is the MLProject file. There are multiple options to package your project: Docker, Conda, or bespoke. We will use Conda for its simplicity in this context.


name: OmdenaPTSD

conda_env: conda.yaml

 command: "python"


The entry point runs the command that should be used when running the project, in this case, a training file.

The conda file contains a name and the dependencies to be used in the project:


name: omdenaptsd-backend
- defaults
  - anaconda
- python==3.6
  - scikit-learn=0.19.1
  - pip:
- mlflow>=1.1


At this point you just need to run the command.


Setting up the REST API classifier backend

To set up a rest classifier backend you don’t need any job setup. You can use a persisted model from a Jupyter notebook.

To run a model you just need to run the models serve command with the URI of the saved artifact:


mlflow models serve -m runs://0/104dea9ea3d14dd08c9f886f31dd07db/LogisticRegressionPipeline
2019/09/01 18:16:49 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2019/09/01 18:16:52 INFO mlflow.pyfunc.backend: === Running command 'source activate 
mlflow-483ff163345a1c89dcd10599b1396df919493fb2 1>&2 && gunicorn --timeout 60 -b -w 1 mlflow.pyfunc.scoring_server.wsgi:app'
[2019-09-01 18:16:52 +0200] [7460] [INFO] Starting gunicorn 19.9.0
[2019-09-01 18:16:52 +0200] [7460] [INFO] Listening at: (7460)
[2019-09-01 18:16:52 +0200] [7460] [INFO] Using worker: sync
[2019-09-01 18:16:52 +0200] [7466] [INFO] Booting worker with pid: 7466


And a scalable backend server (running gunicorn in a very scalable manner) is ready without any code apart from your model training and logging the artifact in the MLFlow packaging strategy. It basically frees Machine Learning engineering teams that want to iterate fast of the initial cumbersome infrastructure work of setting up a repetitive and non-interesting boilerplate prediction API.

You can immediately start launching predictions to your server by:


curl -H 'Content-Type: application/json' -d 
'{"columns":["text"],"data":[[" concatenated text of the transcript"]]}'


The smart thing here is that the MLFlow scoring module uses the Sklearn model input ( pandas schema) as a spec for the Rest API. Sklearn was the example used here it has bindings for (H20, Spark, Keras, Tensorflow, ONNX, Pytorch, etc.). It basically infers the input from the model packaging format and offloads the data to the scoring function. It’s a very neat software engineering approach to a problem faced every day by machine learning teams. Freeing engineers and scientists to innovate instead of working on repetitive boilerplate code.

Going back to the Omdena challenge this backend is available to the frontend team to connect at the most convenient point of the chatbot app to the risk classifier backend ( most likely after a critical mass of open-ended questions).



More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here