Applying Machine Learning to Predict Illegal Dumpsites

Applying Machine Learning to Predict Illegal Dumpsites

By Ramansh Sharma, Rosana de Oliveira Gomes, Simone Vaccari, Emma Roscow, and Prejith Premkumar


Just like any other day, we start our morning with a coffee and a snack to go from our favorite bakery. Later on the same day, we check out our mail where we find letters, newspapers, magazines, and possibly a package that just arrived. Finally at night, after a rough week, we decide to go out to have drinks with friends. Sounds like a pretty uneventful day, right?

Except that we produced lots of trash in the form of plastic, glass, paper, ad more.

According to eurostat, it is estimated that an average person in Europe produces more than 1.3 kg of waste per day (in Canada and the USA, it can go up to more than 2 kg). This is equivalent to a person producing 800 kg of trash per year. Now imagine millions of… billions of people doing the same. Every day!

To give you an even clearer perspective: less than 40% of all the waste produced in Europe is recycled — and it is even less across the other continents. Even further, it is estimated that 20% of all generated waste ends up on illegal dumping (s) in Europe, and 50% in Africa.

TrashOut is an environmental project which aims to map and monitor all illegal dumping (s) around the world and to reduce waste generation by helping citizens to recycle more. This is done through a mobile and web application that helps users with locating and monitoring illegal dumping (s), finding the nearest recycling center or bin, joining local green organizations, reading sustainability-related news, and notifying users about updates on their reports.

In this article, we discuss our analysis of illegal dumping (s) across the world, both in local and global scales.


The problem


Photo by Ocean Cleanup Group on Unsplash


The problem statement for this project was to “build machine learning models on illegal dumping (s) to see if there are any patterns that can help to understand what causes illegal dumping (s), predict potential dumpsites, and eventually how to avoid them”. We decided to tackle this wordy problem statement by dividing it into three manageable sub-tasks to be worked on throughout the duration of the project:

  • Sub-task 1.1: Spatial patterns of existing TrashOut dumpsites
  • Sub-task 1.2: Predict potential dumpsites using Machine Learning
  • Sub-task 1.3: Understanding patterns of existing dumpsites to prevent future potential illegal dumping (s)



  • TrashOut: Reports on illegal dumping (s) provided by users through the TrashOut mobile App. For each report, a number of features are recorded, and the most relevant for this analysis were: location (latitude and longitude, city, country, and continent), date, picture, size, and type of waste.
  • Open Street Maps (OSM): Geospatial dataset and information on the cities road network, including the type of roads (e.g. motorway, primary, residential, etc)
  • Socioeconomic Data and Applications Center (SEDAC): Population density at 1km grid, from which we also calculated the population density gradient to account for population density in the neighboring cells
  • FourSquare: Information about nearby venues
  • World Bank Indicators, World Bank’s “What a Waste 2.0”, Eurostat, European Commission Directorate-General for Environment: Datasets for socio-economic indicators.
  • Non-dumpsites Control Dataset: we generated our own Control Dataset, which was required to train the model on where dumpsites do not occur. For every TrashOut dumpsite location, we selected a pseudo-random location 1 km away and assigned this as a potential non-dumpsite location.



The first challenge was to identify and extract meaningful information for the spatial analysis from the available datasets. Our assumption was that illegal dumping (s) are more likely to occur in highly populated places, in proximity to main roads and in proximity to venues of interest such as sports venues, museums, restaurants, etc. Based on this assumption, we used the available dataset to extract, for every TrashOut dumpsites as well as for every location of our Non-dumpsite/Control Non-dumpsites, the 17 features described in Table 1:


Table 1: Datasets and API’s used to acquire different features for dumpsites * For the control dataset, the source for Continent was pycountry-convert library.


Sub-task 1.1: Finding existing dumpsites

City-based Analysis of Illegal Dumpsites/ Dumping (s)

We performed an in-depth analysis focused on six shortlisted cities, with the goal to represent different social statuses and geographical locations so all continents were included, and based on the availability of a considerable number of TrashOut dumpsite reports. The cities analyzed were:

  • Bratislava, Slovakia (Europe)
  • Campbell River, British Columbia (Canada)
  • London, UK (Europe)
  • Mamuju, Indonesia (Asia)
  • Maputo, Mozambique (Africa)
  • Torreon, Mexico (Central America)

For the city-based analysis, we accessed the road network information from the OSM dataset by using the Python package OSMnx. This API allows easy interaction with the OSM data without needing to download it, which makes it very accessible in any location around the world. We structured the analysis in a Colab Notebook for consistency and analyzed the following features for each city: distance to three types of roads (motorway, main and residential), distance to the city center, population density, size, and type of waste.


Results for Bratislava

The proportion of TrashOut dumpsites vs. Control Non-dumpsites and their proximity to nearest roads within 1 km is shown in Figure 1, however, the statistical assessment was undertaken within 100 m using the two-proportion Z-test. The three graphs are generated for each road type (motorways, main roads, and residential roads) with the purpose to identify whether dumpsites are more likely to appear in proximity of a specific road type. In Bratislava, around one-fifth of dumpsites were found in proximity to the main road (within 100 m), and these were found more likely to be reported next to the main road (within 100 m) compared to locations of Control Non-dumpsites. However, most dumpsites are not reported on roadsides, and in fact, being further away from a road was found to be a slightly better predictor of where a dumpsite might occur.


Figure 1: Proximity to a nearest major road for dumpsites and control datasets


The location of TrashOut dumpsites across Bratislava, colored by reported size, is shown in Figure 2. The majority (around three-quarters) of dumpsites are estimated by TrashOut users to be too big to be taken away in a bag. Dumpsites of all sizes are found throughout the city, but the largest dumpsites tend to be further away from motorways.


Figure 2: Size of dumpsites in the city of Bratislava


Several types of waste were reported alongside other types of waste within the TrashOut dumpsites. The number of dumpsites containing each type of waste is shown in the bar chart in Figure 3.1, whereas in Figure 3.2 is shown the percentage of dumpsites containing several types of waste in a matrix. The majority of reported dumpsites in Bratislava contain what TrashOut users describe as domestic waste. Domestic waste often coincides with plastic waste, which itself is found in around half of the dumpsites. Around one-third of dumpsites are reported to contain construction waste.



Figure 3: Waste types in the TrashOut datasets for Bratislava



Visualizing the distribution of dumpsite reports throughout the city with the spatial analysis undertaken can be informative in preparation to clean up existing dumpsites, as well as for identifying potential new hotspots. The following observations were drawn from this city-level geospatial analysis.

Information about the type and size of dumpsites may be important for local authorities and decision-makers to consider how best to clean up dumpsites. Having a spatial visualization of the locations and characteristics of each dumpsite across each city area, not only helps to inform management efforts to clean up existing dumpsites, but also to try minimizing potential new dumpsites by introducing bins for specific types of waste, or holding events to increase recycling awareness.

Plastic waste is found alongside other types in many dumpsites, which is not surprising. Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting and recycling.

The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This might suggest that residents find construction waste difficult or costly to dispose of legally, or that construction companies are neglecting their responsibility to clean up.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of whether there are more dumpsites in those areas.

The use of these tools and analysis will always need to be supported by local knowledge, as well as with the involvement of local municipalities and authorities.


1.2: Predicting potential dumpsites 

Features to train the Machine Learning model

The second subtask focused on creating a Machine Learning model that could predict whether a location is at risk of becoming a dumpsite. Since we have already seen the variables that were considered to be of a strong influence on dumpsites in Table 1, these variables could be used to predict whether a new location could turn into a dumpsite.

When acquiring the venue categories, we set a radius parameter in the Foursquare library until which distance it is supposed to fetch venue categories information. Although we created datasets with radii 500m, 1km, 2km, and 3km, we came upon the conclusion that the 1km radius dataset was the most appropriate one with the best model performance. It was not too near to the location from which the data was being collected therefore not losing any vital information, and at the same time not too far so that irrelevant information needed to be fetched.

The features: Number of Venue Categories, Nearest Venue Categories, and Frequent Venue Categories were only acquired up until 1km from a given location. Moreover, the five nearest venue categories and five most frequent venue categories were acquired for each given location as separate variables. If the Foursquare API failed to acquire not all 5 (or even none in some cases) categories within a 1km radius, then a None string would be placed instead in the empty variables.

A similar approach was taken for the OSM library for the distance to roads features. The value was only collected for roads up till a 1km radius from a location, with the exception of few cases where the API returned a distance slightly beyond 1km.

For the population density feature, our team discussed different approach ideas, and eventually, we decided that, instead of having a singular value for the population density of the given location, the probability of a dumpsite occurring in (or in very close vicinity of) that location is also affected by the surrounding population. Therefore if the location is in the center of a 1×1 square km cell, then the population densities of the eight 1×1 square km cells around the center cell are also considered. This would be a rather good way to see if the dumpsite is in the middle of a highly-populated area, in the outskirts of the city, or in a nowhere land. Using these nine different population densities (with a more weightage on the center cell’s density), a population gradient is calculated for the location which is given to the model as a separate feature in addition to the population density.

These were the 17 features that would be used for the Machine Learning model part of this subtask. But how do we teach the model what a dumpsite is?


Control dataset

We wanted to train a Machine Learning model such that it learns to understand what constitutes a dumpsite in the 17 variables we gathered. We fetched and calculated the features for every one of the approximately 56,000 dumpsites in the TrashOut dataset that we had. However, it is not possible to train a model just by showing it what a dumpsite is. This is because when we show a location that is highly unlikely to become a dumpsite, we want our model to confidently tell us so. An analogous comparison would be to show 56,000 cats to a child and then expect he/she to recognize that a dog is not a cat.

The solution lies in creating a control dataset. In order to teach the Machine Learning model what a dumpsite is, we also need to teach it to understand what a dumpsite is not.

For the sake of simplicity, we can also call the control dataset non-dumpsites. So how do we go about finding non-dumpsites? Any location that is not a dumpsite is in essence a non-dumpsite. However, that will not help the model learn meaningful differences between the two classes: dumpsites and non-dumpsites. Instead, what we can do is find close geographical points to the dumpsites that we already have and use them as the control dataset. Once again, we experimented with multiple distances from the dumpsite to pick these points and found that a distance of 1 km works best. The advantages of choosing these points are:

  • The points are close enough to the dumpsites so that there are subtle changes in the features that the model will be able to learn and appropriately map to the two classes.
  • The points are not too close so that the model fails to realize key differences between the features of the two classes.
  • The points are not so far that there is no correlation between the two classes, therefore, preventing nullification of the purpose of the control dataset.
  • When we choose a point near a reported dumpsite, we assume that a location nearby a known dumpsite has active users of the TrashOut app in the area, so if there was no dumpsite report, we assumed there was no dumpsite in that location.


Figure 4: Illustration of determining the approach to creating a control dataset.


We also took measures to make sure that the non-dumpsite point generated for each dumpsite did not contain another dumpsite, was not in the vicinity of another dumpsite, and was not in a major water body.

The control dataset was made for every dumpsite so that the two classes were balanced for binary classification. Additionally, all the features that were used in the dumpsite dataset were used for the control dataset as well.



Our team investigated three different Machine Learning models throughout the project:

  1. Random forest classifier — This approach did not work because the model failed to understand the data in a thorough manner and yielded extremely low accuracy. ❌
  2. Neural Networks (Dense and Ensemble) — These series of models and its iterations did not work either because the model was tremendously overfitting. It would be unsuitable for real-world purposes. ❌
  3. Light Gradient Boosted Model (LGBM) — This model was our final model. It had good accuracy and the minimum generalization error among all three models. ✅



The final accuracy of our model was 80% on the test set. We employed the use of k-fold cross-validation to maximize the accuracy our model could achieve on the test set. We also observed how important the individual features were when it came to classifying a given location to be prone to a dumpsite or not. This analysis was done with the help of SHAPELY and PDP plots as shown in Figure 5 below.


Figure 5: Importance of every individual variable on the model performance.


Figure 6: Probability output change in the model induced by a change in distance to roads feature


The SHAPELY plot in Figure 5 shows the contribution of each feature towards the prediction. It depicts the importance that each feature has. A high value of importance signifies that the model considers it as a very important factor when determining if a given location is a dumpsite or not. It is indicated that the most important feature of the model for both classes is the distance to the road variable.

The Partial Dependence Plot in Figure 6 helps one understand the effect of a specific variable on the model output. As the value of a feature changes, its effect on the model will also change accordingly. We compute these plots for all the numerical values that we have, to understand its effect on the prediction of the model. The one shown above is the plot analyzing the effect of distance to roads variable. As it can be observed, as the distance from a given location to a major road increases (positive x-axis), the probability that that location is a dumpsite decreases. The soft spike from 1500 m to 2500 m is due to how we manually placed the value of 2500 m in an example when the API could not find a road up till that distance. Regardless, this situation can be manually handled in the deployment implementation.

One of the key achievements of the team was being able to generate a full city heat map of the city of Bratislava (most of our tests were based here) by running the model on more than 700 locations in the city. In the heat map, the actual dumpsites are plotted in blue markers while the roads are marked in black lines (major roads and highways are visibly thicker than minor roads). The spectrum goes from whitish-yellow to dark red with yellow regions resembling a low probability of becoming a dumpsite and the red regions resembling a high probability of becoming a dumpsite. The heat map provides many beneficial usages. For example, municipalities and local authorities can make smaller heat maps for regional neighborhoods to determine which areas are at a high risk of becoming a dumpsite.

Another important variational use of this heat map is to combine it with valuable insights about socio-economic factors, population density, distance to roads, etc. The reason being that even though the model has considerably good accuracy, the wisdom still lies in the intuition of local authorities and municipalities. These officials will be better equipped to analyze key neighborhoods and areas to find the places where there is a major road or highway nearby, has a high population density, and a certain set of venue categories in close proximity. Then, a heat map can be generated for that area and specific regions can be identified which require immediate attention to mitigate the possibility of becoming a dumpsite.


Figure 7: Heat map of the city of Bratislava using the ML model to predict dumpsites


Sub-task 1.3: Preventing future dumpsites

Global analysis of illegal dumpsites/ dumping (s)

In order to analyze illegal dumpsites/ dumping (s) on a global scale, we combined the data from TrashOut with two other datasets:

From this setup, it was possible to divide the countries analyzed into four clusters, using unsupervised learning:


Figure 8: Global analysis clustering summary. Source: Omdena


Small population developed countries (Blue cluster): countries with a small population and population growth, but high urban populations and access to electricity. These countries also have low urban population growth and GDP. The countries in this cluster present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste.

High population developed countries (Orange cluster): countries with the highest GDP, and also high access to electricity, urban population, tourism. Low inflation and urban population growth. also produces high amounts of glass and paper/cardboard and is responsible for the highest production of special waste and total municipal solid waste: These countries are also associated with the lowest production of organic waste.

High population developing countries (Green cluster): countries with the highest population and inflation; moderate access to electricity and GDP, and growing urban and total population. They generate high amounts of organic, rubber/leather and wood waste, and low amounts of glass. This can be associated with the high population being less concentrated in cities and also to the level of industrialization of such countries, which may be the ones that most produce factory materials such as leather and rubber.

Small population developing countries/low income (Red cluster): countries with a small population and GDP; lowest electricity access, but highest GDP and population (including urban) growth. Most of the waste produced in these countries is from food and organic sources. These are the countries that also produce the lowest amounts of glass, metal, and rubber/leather waste. Such scenarios can be associated with the low populations and also possible use of waste in a sustainable way, as these countries also present low production of green/yard and wood waste.


Figure 9: Some features in the global analysis in country clusters.


Combining this analysis to the illegal dumpsites/ dumping (s) data from TrashOut, our team obtained the following insights:

  • Plastic waste is highly produced across all clusters, indicating that this kind of trash needs to have a global awareness strategy.
  • Different types of illegal trash are associated with rural and urban areas, making it important on a global scale. Countries with a higher rural population (low income) will produce more illegal organic waste, whereas developed countries present more reports on illegal dumpings/ dumping (s) of plastic, cardboard, yard/green waste, rubber/leather, and special waste.
  • Socio-economic factors such as infrastructure, sanitation, inflation, and tourism play a moderate role in the different production of illegal waste worldwide.
  • We identify that population is the most important factor in the production of special waste, municipal solid waste, and the total amounts of waste per year. However, most of the special waste produced in developed countries (majority) is reported.



In this project, we have analyzed data on a local and global scale to understand which factors contribute to illegal dumping, as well as predict and finding possible ways to avoid it.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of the number of dumpsites in those areas.

Nevertheless, visualizing the distribution of dumpsite reports with the spatial analysis undertaken can be informative for identifying potential new hotspots. The following observations can be extracted from our analysis:


On a city level

  • The prediction of the ML model and the heatmap can be used as tools for targeted waste management interventions, but will always need to be supported by local knowledge as well as with the involvement of local municipalities and authorities.
  • The main road and motorway junctions are locations where illegal disposal of waste is prone to occur. We can witness this in Bratislava and Torreón.
  • A lot of reports occur in natural resources areas, e.g. watercourses or natural parks. We can see this in Bratislava, Mamuju, and Campbell River. This may be due to two factors: ease of disposal without being caught; people walking by those areas may be more environmentally aware and wanting to preserve more the places where they go to enjoy nature. Consequently, they create reports more often.
  • Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting. This is especially clear in Maputo city.
  • The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This applies to companies and individuals. Construction waste seems to be a problem for many cities.
  • TrashOut reports seem to be created in pulses, not on a regular basis. For all the cities examined, there are certain months when the number of reports created exceeds the average by far. Moreover, reports seem to be generally created in specific parts of a city (eg. London).


On a global scale

  • Plastic waste production is high across the globe, making it an international problem;
  • Among all the socio-economic factors, population plays a strong role in the production of waste in the world (legal and illegal);
  • The level of development and socio-economic factors (infrastructure, sanitation, education, among others) play an important role in the kind of waste produced by countries.
  • In particular:
  1. Small developed countries present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste
  2. High population developed countries: high amounts of glass and paper/cardboard and the highest production of special waste and total municipal solid waste. These countries are also associated with the lowest production of organic waste
  3. High population developing countries: high amounts of organic, rubber/leather and wood waste, and low amounts of glass;
  4. Small developing countries/low income: most of the waste produced in these countries is from food and organic sources.


Possible Factors to Avoid Illegal Dumping

  1. Organization of clean-up events in areas where many dumpsites are already existing can be arranged to clean up the targeted areas in a fun, interactive, and educational way. Collaboration with local authorities should be put in place to improve existing waste infrastructure and build new ones, if necessary.
  2. Those areas identified as high risk for becoming new potential dumpsites could be targeted with waste infrastructure development/enhancement programs. Additionally, learning events could be organized to raise awareness about dumpsites risks, and how to minimize, or avoid altogether, dumpsites by using properly the waste facility infrastructure existing.
  3. Examples of learning events consist of learning sessions on how to use waste infrastructure and recycling bins according to local/national authorities, the benefits of a sustainable way of living through the 4R cycle (Refuse, Reduce, Reuse, Recycle), and how to avoid single-use items (or a specific type of waste as can be highlighted by a city-level analysis).

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

By Laura Clark Murray, Joanne Burke, and Rishika Rupam


A team of AI experts and data scientists from 12 countries on 4 continents worked collaboratively with the World Resources Institute (WRI) to support efforts to resolve land conflicts and prevent land degradation.

The Problem: Land conflicts get in the way of land restoration

Among its many initiatives, WRI, a global research organization, is leading the way on land restoration — restoring land that has lost its natural productivity and is considered degraded. According to WRI, land degradation reduces the productivity of land, threatening the economy and people’s livelihoods. This can lead to reduced availability of food, water, and energy, and contribute to climate change.

Restoration can return vitality to the land, making it safe for humans, wildlife, and plant communities. While significant restoration efforts are underway around the world, local conflicts get in the way. According to John Brandt of WRI, “Land conflict, especially conflict over land tenure, is a really large barrier to the work that we do around implementing a sustainable land use agenda. Without having clear tenure or ownership of land, long-term solutions, such as forest and landscape restoration, often are not economically viable.”


Photo credit: India’s Ministry of Environment, Forest and Climate Change

Photo credit: India’s Ministry of Environment, Forest and Climate Change


And though governments have instituted policies to deal with land conflicts, knowing where conflicts are underway and how each might be addressed is not a simple task. Says Brandt, “Getting data on where these land conflicts, land degradation, and land grabs occur is often very difficult because they tend to happen in remote areas with very strong language barriers and strong barriers around scale. Events occur in a very distributed manner.” WRI turned to Omdena to use AI and natural language processing techniques to tackle this problem.


The Project Goal: Identify news articles about land conflicts and match them to relevant government policies



“We’re very excited that the results from this partnership were very accurate and very useful to us.

We’re currently scaling up the results to develop sub-national indices of environmental conflict for both Brazil and Indonesia, as well as validating the results in India with data collected in the field by our partner organizations. This data can help supply chain professionals mitigate risk in regards to product-sourcing. The data can also help policymakers who are engaged in active management to think about what works and where those things work.” — John Brandt, World Resources Institute.


The Use Case: Land Conflicts in India

In India, the government has committed 26 million hectares of land for restoration by the year 2030. India is home to a population of 1.35 billion people, has 28 states, 22 languages, and more than 1000 dialects. In a land as vast and varied as India, gathering and collating information about land conflicts is a monumental task.

The team looked to news stories, with a collection of 65,000 articles from India for the years 2017–2018, extracted by WRI from GDELT, the Global Database of Events Language and Tone Project.


Identifying news articles about land conflicts

Land conflicts around land ownership include those between the government and the public, as well as personal conflicts between landowners. Other types of conflicts include those between humans and animals, such as humans invading habitats of tigers, leopards, or elephants, and environmental conflicts, such as floods, droughts, and cyclones.



The team used natural language processing (NLP) techniques to classify each news article in the 65,000 article collection as pertaining to land conflict or not. While this problem can be tackled without the use of any automation tools, it would take human beings years to go through each article and study it, whereas, with the right machine or deep learning model, it would take mere seconds.

A subset of 1,600 newspaper articles from the collection was hand-labeled as “positive” or “negative”, to act as an example of proper classification, or example of proper classification. For example, an article about a tiger attack would be hand-labeled as “positive”, while an article about local elections would be labeled as “negative”.

To prepare the remaining 63,400 articles for an AI pipeline, each article was pre-processed to remove stop words, such as “the” and “in”, and to lemmatize words to return them to their root form. Co-referencing pre-processing was used to increase accuracy. A topic modeling approach was used to further categorize the “positive” articles by the type of conflict, such as Land, Forest, Wildlife, Drought, Farming, Mining, Water. With refinement, the classification model achieved an accuracy of 97%.



With the subset of land conflict articles successfully identified, NLP models were built to identify four key components within each article: actors, quantities, events, and locations. To train the model, the team hand-labeled 147 articles with these components. Using an approach called Named Entity Recognition, the model processed the database of “positive” articles to flag these four components.




Matching land conflict articles to government policies

Numerous government policies exist to deal with land conflicts in India. The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.



A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

The Omdena team built a visual dashboard to display the land conflict events and the matching government policies. In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.



Underlying this dashboard are the NLP models that classify news articles related to land conflict, and land degradation, and match them to the appropriate government policy.



The results of this pilot project have been used by the World Resources Institute to inform their next stage of development.

Join one of our upcoming demo days to see the power of Collaborative AI in action.

Want to watch the full demo day?

Check out the entire recording (including a live demonstration of the tool).


Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

By Laura Clark Murray, Nikhel Gupta, Joanne Burke, Rishika Rupam, Zaheeda Tshankie


Download the PDF version of this whitepaper here.

Project Overview

This project aimed to provide a proof-of-concept machine-learning-based methodology to identify land conflicts events in geography and match those events to relevant government policies. The overall objective is to offer a platform where policymakers can be made aware of land conflicts as they unfold and identify existing policies that are relevant to the resolution of those conflicts.

Several Natural Language Processing (NLP) models were built to identify and categorize land conflict events in news articles and to match those land conflict events to relevant policies. A web-based tool that houses the models allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The geographic scope of the project was limited to India, which has the most environmental (land) conflicts of all countries on Earth.



Degraded land is “land that has lost some degree of its productivity due to human-caused process”, according to the World Resources Institute. Land degradation affects 3.2 billion people and costs the global economy about 10 percent of the gross product each year. While dozens of countries have committed to restore 350 million hectares of degraded land, land disputes are a major barrier to effective implementation. Without streamlined access to land use rights, landowners are not able to implement sustainable land-use practices. In India, where 21 million hectares of land have been committed to the restoration, land conflicts affect more than 3 million people each year.

AI and machine learning offer tremendous potential to not only identify land-use conflicts events but also match suitable policies for their resolution.


Data Collection

All data used in this project is in the public domain.

News Article Corpus: Contained 65,000 candidate news articles from Indian and international newspapers from the years 2008, 2017, and 2018. The articles were obtained from the Global Database of Events Language and Tone Project (GDELT), “a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages.” All the text was either originally in English or translated to English by GDELT.

  • Annotated Corpus: Approximately 1,600 news articles from the full News Article Corpus were manually labeled and double-checked as Negative (no conflict news) and Positive (conflict news).
  • Gold Standard Corpus: An additional 200 annotated positive conflict news articles, provided by WRI.
  • Policy Database: Collection of 19 public policy documents related to land conflicts, provided by WRI.




Text Preparation


In this phase, the articles of the News Article Corpus and policy documents of the Policy Database were prepared for the natural language processing models.

The articles and policy documents were processed using SpaCy, an open-source library for natural language processing, to achieve the following:

  • Tokenization: Segmenting text into words, punctuation marks, and other elements.
  • Part-of-speech (POS) tagging: Assigning word types to tokens, such as “verb” or “noun”
  • Dependency parsing: Assigning syntactic dependency labels to describe the relations between individual tokens, such as “subject” or “object”
  • Lemmatization: Assigning the base forms of words, regardless of tense or plurality
  • Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.
  • Named Entity Recognition (NER): Labelling named “real-world” objects, like persons, companies, or locations.


Coreference resolution was applied to the processed text data using Neuralcoref, which is based on an underlying neural net scoring model. With coreference resolution, all common expressions that refer to the same entity were located within the text. All pronominal words in the text, such as her, she, he, his, them, their, and us, were replaced with the nouns to which they referred.


For example, consider this sample text:

“Farmers were caught in a flood. They were tending to their field when a dam burst and swept them away.”

Neuralcoref recognizes “Farmers”, “they”, “their” and “them” as referring to the same entity. The processed sentence becomes:

Farmers were caught in a flood. Farmers were tending to their field when a dam burst and swept farmers away.”


Coreference resolution of sample sentences



Document Classification


The objective of this phase was to build a model to categorize the articles in the News Article Corpus as either “Negative”, meaning they were not about conflict events, or “Positive”, meaning they were about conflict events.

After preparation of the articles in the News Article Corpus, as described in the previous section, the texts were then prepared for classification.

First, an Annotated Corpus was formed to train the classification model. A 1,600 article subset of the News Article Corpus was manually labeled as “Negative” or “Positive”.

To prepare the articles in both the News Article Corpus and Annotated Corpus for classification, the previously pre-processed text data of the articles was represented as vectors using the Bag of Words approach. With this approach, the text is represented as a collection, or “bag”, of the words it contains along with the frequency with which each word appears. The order of words is ignored.

For example, consider a text article comprised of these two sentences:

Sentence 1: “Zahra is sick with a fever.”

Sentence 2: “Arun is happy he is not sick with a fever.”

This text contains a total of ten words: “Zahra”, “is”, “sick”, “happy”, “with”, “a”, “fever”, “not”, “Arun”, “he”. Each sentence in the text is represented as a vector, where each index in the vector indicates the frequency that one particular word appears in that sentence, as illustrated below.




With this technique, each sentence is represented by a vector, as follows:

“Zahra is sick with a fever.”

[1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

“Arun is happy he is not sick with a fever.”

[0, 2, 1, 1, 1, 1, 1, 1, 1, 1]

With the Annotated Corpus vectorized with this technique, the data was used to train a logistic regression classifier model. The trained model was then used with the vectorized data of the News Article Corpus, to classify each article into Positive and Negative conflict categories.

The accuracy of the classification model was measured by looking at the percentage of the following:

  • True Positive: Articles correctly classified as relating to land conflicts
  • False Positive: Articles incorrectly classified as relating to land conflicts
  • True Negative: Articles correctly classified as not being related to land conflicts
  • False Negative: Articles incorrectly classified as not being related to land conflicts


The “precision” of the model indicates how many of those articles classified to be about the land conflict were actually about land conflict. The “recall” of the model indicates how many of the articles that were actually about the land conflict were categorized correctly. An f1-score was calculated from the precision and recall scores.

The trained logistic regression model successfully classified the news articles with precision, recall, and f1-score of 98% or greater. This indicates that produced a low number of false positives and false negatives.


Classification report using a test dataset and logistic regression model



Categorize by Land Conflicts Events

The objective of this phase was to build a model to identify the set of conflict events referred to in the collection of positive conflict articles and then to classify each positive conflict article accordingly.

A word cloud of the articles in the Gold Standard Corpus gives a sense of the content covered in the articles.

A topic model was built to discover the set of conflict topics that occur in the Positive conflict articles. We chose a semi-supervised approach to topic modeling to maximize the accuracy of the classification process. We chose to use CorEx (Correlation Explanation), a semi-supervised topic model that allows domain knowledge, as specified by relevant keywords acting as “anchors”, to guide the topic analysis.

To align with the Land Conflicts Policies provided by WRI, seven relevant core land conflicts topics were specified. For each topic, correlated keywords were specified as “anchors” for the topic.




The trained topic model provided 3 words for each of the seven topics:

  • Topic #1: land, resettlement, degradation
  • Topic #2: crops, farm, agriculture
  • Topic #3: mining, coal, sand
  • Topic #4: forest, trees, deforestation
  • Topic #5: animal, attacked, tiger
  • Topic #6: drought, climate change, rain
  • Topic #7: water, drinking, dams

The resulting topic model is 93% accurate. This scatter plot uses word representations to provide a visualization of the model’s classification of the Gold Standard Corpus and hand-labeled positive conflict articles.


Visualization of the topic classification of the Gold Standard Corpus and Positive Conflict Articles



Identify the Actors, Actions, Scale, Locations, and Dates

The objective of this phase was to build a model to identify the actors, actions, scale, locations, and dates in each positive conflict article.

Typically, names, places, and famous landmarks are identified through Named Entity Recognition (NER). Recognition of such standard entities is built-in with SpaCy’s NER package, by which our model detected the locations and dates in the positive conflict articles. The specialized content of the news articles required further training with “custom entities” — those particular to this context of land conlficts.

All the positive conflict articles in the Annotated Corpus were manually labeled for “custom entities”:

  • Actors: Such as “Government”, “Farmer”, “Police”, “Rains”, “Lion”
  • Actions: Such as “protest”, “attack”, “killed”
  • Numbers: Number of people affected by a conflict

This example shows how this labeling looks for some text in one article:



These labeled positive conflict articles were used to train our custom entity recognizer model. That model was then used to find and label the custom entities in the news articles in the News Article Corpus.


Match Conflicts to Relevant Policies

The objective of this phase was to build a model to match each processed positive conflict article to any relevant policies.

The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.


Excerpt of a 2001 policy document related to agriculture



A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

Cosine similarity calculates similarity by measuring the cosine of an angle between two vectors. Using the vectorized text of the articles and the policy documents that had been generated in the previous phases as described above, the model generated a collection of matches between articles and policies.


Visualization of Conflict Event and Policy Matching

The objective of this phase was to build a web-based tool for the visualization of the conflict event and policy matches.

An application was created using the Plotly Python Open Source Graphing Library. The web-based tool houses the models and allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The map displays land conflict events detected in the News Article Corpus for the selected years and regions of India.

Conflict events are displayed as color-coded dots on a map. The colors correspond to specific conflict categories, such as “Agriculture” and “ Environmental”, and actors, such as “Government”, “Rebels”, and “Civilian”.

In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.




By selecting a particular category from the right column, only those conflicts related to that category are displayed on the map. Here only the Agriculture-related subset of the events shown in the previous example is displayed.



News articles from the select years and regions are displayed below the map. When a particular article is selected, the location of the event is shown on the map. The text of the article is displayed along with policies matched to the event by the underlying models, as seen in the example below of a 2018 agriculture-related conflict in the Andhra Pradesh region.



Here is a closer look at the article and matched policies in the example above.




Next Steps

This overview describes the results of a pilot project to use natural language processing techniques to identify land conflict events described in news articles and match them to relevant government policies. The project demonstrated that NLP techniques can be successfully deployed to meet this objective.

Potential improvements include refinement of the models and further development of the visualization tool. Opportunities to scale the project include building the library of news articles with those published from additional years and sources, adding to the database of policies, and expanding the geographic focus beyond India.

Opportunities to improve and scale the pilot project


  • Refine models
  • Further development of visualization tool


  • Expand library of articles with content from additional years and sources
  • Expand the database of policies
  • Expand the geographic focus beyond India



About the Authors

  • Laura Clark Murray is the Chief Partnership & Strategy Officer at Omdena. Contact:
  • Nikhel Gupta is a physicist, a Postdoctoral Fellow at the University of Melbourne, and a machine learning engineer with Omdena.
  • Joanne Burke is a data scientist with MUFG and a machine learning engineer with Omdena.
  • Rishika Rupam is a Data and AI Researcher with Tilkal and a machine learning engineer with Omdena.
  • Zaheeda Tshankie is a Junior Data Scientist with Telkom and a machine learning engineer with Omdena.


Omdena Project Team

Kulsoom Abdullah, Joanne Burke, Antonia Calvi, Dennis Dondergoor, Tomasz Grzegorzek, Nikhel Gupta, Sai Tanya Kumbharageri, Michael Lerner, Irene Nanduttu, Kali Prasad, Jose Manuel Ramirez R., Rishika Rupam, Saurav Suresh, Shivam Swarnkar, Jyothsna sai Tagirisa, Elizabeth Tischenko, Carlos Arturo Pimentel Trujillo, Zaheeda Tshankie, Gabriela Urquieta



This project was done in collaboration with Kathleen Buckingham and John Brandt, our partners with the World Resources Institute (WRI).



About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through global bottom-up collaboration. Omdena is a partner of the United Nations AI for Good Global Summit 2020.

Generating Images with Just Noise using GANs

Generating Images with Just Noise using GANs


Using GAN networks for satellite image quality augmentation to identify trees next to power stations more accurately. The solution from this project helps to prevent power outages and fires sparked by falling trees and storms.


Using Generative Adversarial Network (GAN) for Data Augmentation


The GAN stands for Generative Adversarial Network, which is essentially applying game theory and put a couple of artificial neural networks to compete with each other while they are trained at the same time. One network tries to generate the image and the other tries to detect if it is real or fake. Actually, it is something very simple, but pretty effective too. This is clearer with an image:


But again, how can we use this to accomplish our goal? It turns out that there is a kind of GAN Network named pix2pix for Data Augmentation. This kind of GAN can be used as an input, a pre-defined sketch of the real one. Like take a doodle and from there build a picture like a landscape or anything you want. An example of this is the application that Nvidia did to generate artificial landscapes. The Link for the video is given here.

Ok, so maybe this can work. At that moment the label team has already labeled some images, so if we use these labels to build some doodles, then we can use this to train a GAN to generate the images. It actually works!




So now we just need to find a way to generate random doodles to feed the pix2pix GAN. So here is another GAN to the rescue, a DCGAN in this case. So, in this case, the idea was to generate a random doodle from random noise. Getting something like this:



And finally putting all the pieces together, with the help of some Python and Opencv code, we end up with a script that generates a 100% random image from pure noise with the corresponding labels. At the moment we can generate thousands of synthetic images with their corresponding labels in a JSON file in coco format. For the labels, we use the doodle to get labels by masking the colors and then build the synthetic images from the doodle.





For now, the results look promising, but they are just preliminary results and can be enhanced, for example, the labels that we use, only had labels for trees or not trees, this can be enhanced by another label to make the model more specific and accurate, like for example also label roads, fields, buildings, lakes, rivers and so on, to make the model generate this stuff.



More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up global collaboration

Building An Elevation Map For Forest Cover using Deep Learning

Building An Elevation Map For Forest Cover using Deep Learning

The Problem: How to build an Elevation Map?

Now, how do we get to the elevation map? And what do we use it for?

Those were the main questions we asked ourselves, considering that we will be doing something great for the world in this project regarding the understanding of topographical maps using deep learning.

First, what is an Elevation Map anyway?



An elevation map shows the various elevations in a region. Elevation in maps is shown using contour lines (level curves), bands of the same color (using imagery processing), or by numerical values giving the exact elevation details. Elevation maps are generally known as Topographical Maps by using Deep Learning.


The Solution: Generating Elevation Maps


Diagram of Elevation Map Process


What needs to be created?

  • A Digital Elevation Model (DEM).


Digital Elevation Model (example)


A digital Elevation Model is a specialized database that represents the relief of a surface between points of known elevation. It is the digital representation of the land surface elevation.


  • Level Curves (contour line).



Contour lines are the most common method of showing relief and elevation on a standard topographical map using deep learning. A contour line represents an imaginary line of the ground, above or below sea level. Contour lines form circles (or go off the map). The inside of the circle is the top of a hill.

We worked with the DEM to create the contour lines, using GIS open-source software, in this case, we used a GIS software called QGIS with a plugin’ called “Contour”, which uses the elevations of the DEM to define the level curves and obtain a contour line model of the study area (it is possible to define the distance between each level curve, which in this case occurred every two meters).


DEM converted into Contours.


  • Triangulated Irregular Network (TIN).




Next, we need a Triangular Irregular Network (TIN) with vector-based lines and three-dimensional coordinates (x,y,z). The TIN model represents a surface as a set of contiguous, non-overlapping triangles.

We applied computational geometry (Delaunay Triangulation) to create the TIN.

Delaunay triangulations are widely used in scientific computing in many diverse applications. While there are numerous algorithms for computing triangulations, it is the favorable geometric properties of the Delaunay triangulation that make it so useful.

For modeling terrain or other objects with a set of sample points, the Delaunay Triangulation gives a good set of triangles to use as polygons in the model. In particular, the Delaunay Triangulation avoids narrow triangles (as they have large circumcircles compared to their area).


  • Digital Terrain Model (DTM).




A DTM is a vector dataset composed of regularly spaced points and natural features such as ridges and break lines. A DTM augments a DEM by including linear features of the bare earth terrain.

DTM’s are typically created through stereophotogrammetry, but in this case, we downloaded a point surface of the terrain.

The points are called LiDAR points, a collection of points that represents a 3D shape or feature. Each point has its own set of X, Y, and Z coordinates and in some cases additional attributes. We can think about a point cloud as a collection of multiple points and converted into a DTM using GIS open-source software.


The Results

After applying previous techniques in the context of identifying trees, we got the following results.


  • Our Digital Elevation Model


Digital Elevation Model of the Study Area


  • The Contour Lines


Level Curves of the Study Area


  • Our Triangulated Irregular Network


TIN of the Study Area


  • The Digital Terrain Model


DTM of the Study Area



These results are part of an overal solution to identify trees close to power stations and allows us to find and determine the elevations of the forest cover, as well as the aspect of the land, either for geological purposes (determination of land use), forestry (control of protected natural areas), or to prevent fires in high-risk areas (anthropic causes).

The elevation map can give great help to public agencies to perform spatial analysis, manage large amounts of spatial data, and produce cartographically appealing maps to aid in decision making. It improves the responders efficiency and performance by giving them rapid access to critical data during an incident.



More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up global collaboration.

Using Neural Networks to Predict Droughts, Floods and Conflict Displacements in Somalia

Using Neural Networks to Predict Droughts, Floods and Conflict Displacements in Somalia


The Problem


Millions of people are forced to leave their current area of residence or community due to resource shortage and natural disasters such as droughts, floods. Our project partner, UNHCR, provides assistance and protection for those who are forcibly displaced inside Somalia.

The goal of this challenge was to create a solution that quantifies the influence of climate change anomalies on forced displacement and/or violent conflict through satellite imaging analysis and neural networks for Somalia.


The Data 

The UNHCR Innovation team provided the displacement dataset, which contains:

Month End, Year Week, Current (Arrival) Region, Current (Arrival) District, Previous (Departure) Region, Previous (Departure) District, Reason, Current (Arrival) Priority Need, Number of Individuals. These internal displacements are weekly recorded since 2016.

While searching for how to extract the data we learned about NDVI (Normalized difference vegetation index), and NDWI (Normalized Difference Water Index).

Our focus was on finding a way to apply NDVI and NDWI on Satellite Imaging and Neural Networks to prevent Climate Change disasters.

Landsat (EarthExplorer) and MODIS, Hydrology (e.g. river levels, river discharge, an indication of floods/drought), Settlement/shelters GEO (GEO portal). These images have 13 bands and take up around 1GB of storage space per image.

Also, the National Environmental Satellite, Data, and Information Service (NESDIS) and National Oceanic and Atmospheric Administration (NOAA) offer very interesting data like Somalia Vegetation Health print screens taken from STAR — Global Vegetation Health Products.




By looking at the above picture points I figured that the Vegetation Health Index (VHI) could be having a correlation with people displacement.


We found an interesting chart, which captured my attention,

  • Go to STAR’s web page.
  • Click on Data type and select which kind of data you want
  • Check the following image




  •  Click on the region of interest and follow the steps below





VHI index’s weekly since 1984



STAR’s web page provides SMN, SMT, VCI, TCI, VHI index’s weekly since 1984 split in provinces.

SMN= Provincial mean NDVI with noise reduced
SMT=Provincial mean brightness Temperature with noice reduced
VCI = Vegetation cond index ( VCI <40 indicates moisture stress; VCI >60: favorable condition)
TCI= thermal condition Index (TCI <40 indicates thermal stress; TCI >60: favorable condition)
VHI =vegetation Health Index (VHI <40 indicates vegetation stress; VHI >60: favorable condition))

Drought vegetation

VHI<15 indicates drought from severe-to-exceptional intensity

VHI<35 indicates drought from moderate-to-exceptional intensity

VHI>65 indicates good vegetation condition

VHI>85 indicates very good vegetation condition

In order to derive insights from the findings, the following questions needed to be answered.

Does vegetation health correlate to displacements? And is there a lag between vegetation health and observed displacement? Below visualizations provide answers.


Correlation between Vegetation Health Index values of Shabeellaha Hoose and the number of individuals registered due to Conflict/Insecurity.



Correlation between the Number of Individuals from Hiiraan Displacements caused by flood and VHI data.



Correlation between the Number of Individuals from Sool Displacements caused by drought.



The Solution: Building the Neural Network

We developed a neural network that predicts the weekly VHI of Somalia using historical data as described above. You can find the model here.

The model produces a validation loss of 0.030 and training loss of 0.005, Below is the prediction of the neural network using test data.


Prediction versus the original value




More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.


Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here