Demo Day Insights | Accelerating the Clean Energy Transition | World Energy Council

Demo Day Insights | Accelerating the Clean Energy Transition | World Energy Council

By Rosana de Oliveira Gomes

Two Omdena teams with a total of 50 AI experts and data scientists from 25 countries collaborated with the World Energy Council and the Nigerian NGO RA365 in carrying out data-driven analyses and providing AI solutions to address the Global Transition to Clean Energy.

At a recent Omdena Demo Day, team members Amardeep Singh, Julia Wabant, and Simon Mackenzie shared the results and insights gained from these two projects.

 

The Topic: Energy Transition

One of the Sustainable Development Goals adopted by all United Nations Member States in 2015 aims to ensure access to affordable, reliable, sustainable, and modern clean energy for all by 2030.

Transitioning into a society with cleaner energy is crucial for fighting climate change. Different parts of the world are currently facing different stages of the energy transition. This can be noted both on the implementation of solutions in specific regions as well as in the cultural perception of such transition by societies. Both topics are addressed in the following two Omdena use cases.

 

1. Use Case: AI for Renewable Energy in Nigeria

 

Clean Energy 

 

Nigeria is one of the countries in the world facing the most severe energy challenges. Over half of the country’s population — 100 million people — lack access to electricity. Some of the problems faced by Nigerians include precarious electricity systems, unstable electricity supply, and electricity available only in certain locations.

An alternative to these problems is investing in local and renewable power solutions. Renewable Africa RA365 is an NGO with the mission to end energy poverty in Nigeria by leveraging innovative clean energy solutions and focusing on providing solar energy to vulnerable populations. In this project, the Omdena team partnered up with RA365 with the goal of identifying communities where solar panels would add the most value.

The first task in this challenge was to define what these areas should be: groups of about 4000 people living within a radius of about 500 m, and that are located more than 15 km away from a power grid. Regions close to schools, healthcare centers, and water locations were considered to have a higher ranking of priority, as they can benefit even more from renewable energy implementation.

One of the biggest challenges in the project was the lack of data for population density, making it hard to identify where people need assistance. In order to find out how the population is distributed in Nigeria and determine who is without access to electricity, the team compared nighttime satellite imagery from NASA Black Marble VIIRS against the geographic location of the population using the Demographic and Health Surveys (DHS) program, ground surveys from WorldPop, and the GRID3 dataset. Also, for identifying the national grid location and, therefore, find regions where people live in relation to existing power lines, the team applied Machine Learning techniques on satellite images from the HV grid from Development Seed/World Bank.

 

Clean Energy

Combined two satellite data information on average over a large number of nights and seasons.

 

Clean Energy

 

Finally, the team finally worked on finding, among all these towns without electricity supply, which ones would be suitable for the criteria established for the implementation of a local solar energy system. This was done by clustering 4000 people in a 500 m radius using the DbScan clustering technique, leading to the identification of over a thousand high-potential regions.

 

Clean Energy

 

Clusters of towns with populations between 4–15 thousand people which are suitable for potential off-grid solar navigation in the North of Nigeria.

The Omdena’s team deliverable for this project: A prototype interactive map of the whole of Nigeria identifying the regions with a high demand for electricity and a high potential for solar.

The next steps for this project include a detailed survey for the top target areas in order to identify which locations are most suitable both in terms of infrastructure and cost for implementation of solar systems.

A detailed description of this project and its documentation are available in other Omdena publications. See more about the background of this project in this Omdena article.

 

The Impact

The initiative taken by Omdena and Renewable Africa RA365 has the potential of enabling data-driven investments and policy-making that can change the lives of many people in Nigeria and other African countries.

The data and prototype of this project have been shared with the Lagos State Government agency for solar systems, which is now willing to start the process of mass production already in 2020.

“In order to get this job done, it is not all about providing solutions to these people. We want to make sure that the solutions get to the right people at the right places, and Omdena has really helped us to achieve that.”

Joseph Itopa, Machine Learning Engineer at Renewable Africa RA365

 

2. Use Case: Sentiment Analysis on Energy Transition

The transition away from dependency on CO2 to a more sustainable society dominates the news headlines worldwide, exposing conflicting opinions and political measures driving towards a future with cleaner energy sources. Understanding the clean energy transition at a human-level is crucial to the effectiveness of whatever steps are taken in the direction of a carbon-free society.

Commissioned by the World Energy Council, the world’s leading member-based global energy network, Omdena explored applications of AI in understanding how people in different regions of the world perceive the energy transition and their role in it.

Using natural language processing (NLP) techniques, the team created tools to collect, scrape, and analyze text about the clean energy transition found on different social media sources (Twitter, YouTube, Facebook, Reddit, and famous newspapers). This text data was analyzed using varying methods, such as sentiment analysis, topic modeling, and clustering to reveal the challenges, reactions, and attitudes of citizens around the world.

 

Sentiment Analysis Reddit

Topic “Energy transition” for the USA on Reddit.

 

Visualizations of the results allow for comparisons of sentiments across nations and societies. The analysis was first focused on English speaking countries, as this provides a common basis for comparing text. For this, the countries representative of different continents and development backgrounds were: USA (America), UK (Europe), Nigeria (Africa), and India (Asia).

 

Renewable Energy

Data Analysis of Twitter data.

 

The word cloud representation of the results shows that among the 4 countries investigated, only Nigeria has prominent tweets about “electricity supply”. Similarly, “gas prices” are specific to the USA. However, “renewable energy” is present in all 4 countries.

A part of the analysis was also expanded to other countries and languages, gathering and analyzing tweets related to complaints about “renewable energy cost” in more than 20 countries. The results revealed how local conditions and culture can differ significantly from different places. For example, “technology” was the most relevant concern in the complaint tweets in Brazil and France, whereas in Nigeria these tweets were focused solely on “policy”.

 

Energy Transition

Complaints about Energy Transition

 

Other short and detailed discussions about this project can be found in Omdena publications.

 

The Impact

Though broad conclusions cannot be drawn from these isolated collections of data, the results point to models and data sets that are promising for further development. The analysis carried out by the Omdena team allowed for a better understanding of how natural language processing techniques can be used to capture the opinions and concerns of people worldwide about the clean energy transition.

“The Council has been interested in how public sentiment on energy issues might be tracked, or if this were even possible. That is where this project came in — the team at Omdena explored the broad brief and have proven that the conceptual idea is possible.”

Martin Young, Senior Director at the World Energy Council

 

The demo day recording

 

 

Collaborators from this project

We thank our partner organizations, Renewable Africa 365 and the World Energy Council. as well as all Omdena collaborators (listed below) who made the project a success.

 

Omdenda team members, on the Renewable Energy Nigeria project:

  • Anastasis Stamatis, Greece
  • Daniil Khodosko, Canada
  • Peace Bakare, Nigeria
  • John Wu, Australia
  • Siddharth Srivastava, India
  • Simon Mackenzie, UK
  • Hoa Nguyen, Vietnam
  • Takashi Daido, Japan
  • Jessica Alecci, Netherlands/Italy
  • Jack David, UK
  • Shubham Bindal, India
  • Deborah David, France
  • Qi Han, Singapore
  • Stefan Hrouda-Rasmussen, Denmark
  • Varun G P, India
  • Ifeoma Okoh (Ify), Nigeria
  • Suraiya Khan, Canada
  • Ivan Tzompov, Bulgaria
  • Henrique Mendonca, Switzerland
  • Himadri Mishra, India
  • Sai Praveen, India
  • Jaikanth J, India
  • Krithiga Ramadass, India

 

Omdena team members, on the Energy Transition Social Sentiment project:

  • Syed Hassan, UAE
  • Julia Jakubczak, Poland
  • Marek Cichy, Poland
  • Krithiga Ramadass, India
  • Abhishek Deshpande, India
  • Julia Wabant, France
  • Simon Mackenzie, UK
  • Alejandro Bautista Ramos, Mexico
  • Irune Lansorena Sanchez, Spain
  • Vishal Ramesh, India
  • Elizabeth Tishenko, Poland
  • Shashank Agrawal, India
  • Ilias Papadopoulos, Greece
  • Aqueel Jivan, USA
  • Nicholas Musau, Kenya
  • Matteo Bustreo, Italy
  • Mahzad Khoshlessan, USA
  • Yamuna Dulanjani, Sri Lanka
  • Fiona, USA
  • Murindanyi Sudi, Rwanda
  • Raghhuveer Jaikanth, India
  • Abhishek Gupta, USA
  • Aboli Marathe, India
  • Momodou B Jallow, China
  • Jordi Frank, USA
  • Amardeep Singh, Canada
  • Julie Maina, Kenya
 
 
 
 
 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

By Laura Clark Murray, Joanne Burke, and Rishika Rupam

 

A team of AI experts and data scientists from 12 countries on 4 continents worked collaboratively with the World Resources Institute (WRI) to support efforts to resolve land conflicts and prevent land degradation.

The Problem: Land conflicts get in the way of land restoration

Among its many initiatives, WRI, a global research organization, is leading the way on land restoration — restoring land that has lost its natural productivity and is considered degraded. According to WRI, land degradation reduces the productivity of land, threatening the economy and people’s livelihoods. This can lead to reduced availability of food, water, and energy, and contribute to climate change.

Restoration can return vitality to the land, making it safe for humans, wildlife, and plant communities. While significant restoration efforts are underway around the world, local conflicts get in the way. According to John Brandt of WRI, “Land conflict, especially conflict over land tenure, is a really large barrier to the work that we do around implementing a sustainable land use agenda. Without having clear tenure or ownership of land, long-term solutions, such as forest and landscape restoration, often are not economically viable.”

 

Photo credit: India’s Ministry of Environment, Forest and Climate Change

Photo credit: India’s Ministry of Environment, Forest and Climate Change

 

And though governments have instituted policies to deal with land conflicts, knowing where conflicts are underway and how each might be addressed is not a simple task. Says Brandt, “Getting data on where these land conflicts, land degradation, and land grabs occur is often very difficult because they tend to happen in remote areas with very strong language barriers and strong barriers around scale. Events occur in a very distributed manner.” WRI turned to Omdena to use AI and natural language processing techniques to tackle this problem.

 

The Project Goal: Identify news articles about land conflicts and match them to relevant government policies

 

Impact

“We’re very excited that the results from this partnership were very accurate and very useful to us.

We’re currently scaling up the results to develop sub-national indices of environmental conflict for both Brazil and Indonesia, as well as validating the results in India with data collected in the field by our partner organizations. This data can help supply chain professionals mitigate risk in regards to product-sourcing. The data can also help policymakers who are engaged in active management to think about what works and where those things work.” — John Brandt, World Resources Institute.

 

The Use Case: Land Conflicts in India

In India, the government has committed 26 million hectares of land for restoration by the year 2030. India is home to a population of 1.35 billion people, has 28 states, 22 languages, and more than 1000 dialects. In a land as vast and varied as India, gathering and collating information about land conflicts is a monumental task.

The team looked to news stories, with a collection of 65,000 articles from India for the years 2017–2018, extracted by WRI from GDELT, the Global Database of Events Language and Tone Project.

 

Identifying news articles about land conflicts

Land conflicts around land ownership include those between the government and the public, as well as personal conflicts between landowners. Other types of conflicts include those between humans and animals, such as humans invading habitats of tigers, leopards, or elephants, and environmental conflicts, such as floods, droughts, and cyclones.

 

 

The team used natural language processing (NLP) techniques to classify each news article in the 65,000 article collection as pertaining to land conflict or not. While this problem can be tackled without the use of any automation tools, it would take human beings years to go through each article and study it, whereas, with the right machine or deep learning model, it would take mere seconds.

A subset of 1,600 newspaper articles from the collection was hand-labeled as “positive” or “negative”, to act as an example of proper classification, or example of proper classification. For example, an article about a tiger attack would be hand-labeled as “positive”, while an article about local elections would be labeled as “negative”.

To prepare the remaining 63,400 articles for an AI pipeline, each article was pre-processed to remove stop words, such as “the” and “in”, and to lemmatize words to return them to their root form. Co-referencing pre-processing was used to increase accuracy. A topic modeling approach was used to further categorize the “positive” articles by the type of conflict, such as Land, Forest, Wildlife, Drought, Farming, Mining, Water. With refinement, the classification model achieved an accuracy of 97%.

 

 

With the subset of land conflict articles successfully identified, NLP models were built to identify four key components within each article: actors, quantities, events, and locations. To train the model, the team hand-labeled 147 articles with these components. Using an approach called Named Entity Recognition, the model processed the database of “positive” articles to flag these four components.

 

 

 

Matching land conflict articles to government policies

Numerous government policies exist to deal with land conflicts in India. The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.

 

 

A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

The Omdena team built a visual dashboard to display the land conflict events and the matching government policies. In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.

 

 

Underlying this dashboard are the NLP models that classify news articles related to land conflict, and land degradation, and match them to the appropriate government policy.

 

 

The results of this pilot project have been used by the World Resources Institute to inform their next stage of development.

Join one of our upcoming demo days to see the power of Collaborative AI in action.

Want to watch the full demo day?

Check out the entire recording (including a live demonstration of the tool).

 

Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

By Laura Clark Murray, Nikhel Gupta, Joanne Burke, Rishika Rupam, Zaheeda Tshankie

 

Download the PDF version of this whitepaper here.

Project Overview

This project aimed to provide a proof-of-concept machine-learning-based methodology to identify land conflicts events in geography and match those events to relevant government policies. The overall objective is to offer a platform where policymakers can be made aware of land conflicts as they unfold and identify existing policies that are relevant to the resolution of those conflicts.

Several Natural Language Processing (NLP) models were built to identify and categorize land conflict events in news articles and to match those land conflict events to relevant policies. A web-based tool that houses the models allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The geographic scope of the project was limited to India, which has the most environmental (land) conflicts of all countries on Earth.

 

Background

Degraded land is “land that has lost some degree of its productivity due to human-caused process”, according to the World Resources Institute. Land degradation affects 3.2 billion people and costs the global economy about 10 percent of the gross product each year. While dozens of countries have committed to restore 350 million hectares of degraded land, land disputes are a major barrier to effective implementation. Without streamlined access to land use rights, landowners are not able to implement sustainable land-use practices. In India, where 21 million hectares of land have been committed to the restoration, land conflicts affect more than 3 million people each year.

AI and machine learning offer tremendous potential to not only identify land-use conflicts events but also match suitable policies for their resolution.

 

Data Collection

All data used in this project is in the public domain.

News Article Corpus: Contained 65,000 candidate news articles from Indian and international newspapers from the years 2008, 2017, and 2018. The articles were obtained from the Global Database of Events Language and Tone Project (GDELT), “a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages.” All the text was either originally in English or translated to English by GDELT.

  • Annotated Corpus: Approximately 1,600 news articles from the full News Article Corpus were manually labeled and double-checked as Negative (no conflict news) and Positive (conflict news).
  • Gold Standard Corpus: An additional 200 annotated positive conflict news articles, provided by WRI.
  • Policy Database: Collection of 19 public policy documents related to land conflicts, provided by WRI.

 

Approach

 

Text Preparation

 

In this phase, the articles of the News Article Corpus and policy documents of the Policy Database were prepared for the natural language processing models.

The articles and policy documents were processed using SpaCy, an open-source library for natural language processing, to achieve the following:

  • Tokenization: Segmenting text into words, punctuation marks, and other elements.
  • Part-of-speech (POS) tagging: Assigning word types to tokens, such as “verb” or “noun”
  • Dependency parsing: Assigning syntactic dependency labels to describe the relations between individual tokens, such as “subject” or “object”
  • Lemmatization: Assigning the base forms of words, regardless of tense or plurality
  • Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.
  • Named Entity Recognition (NER): Labelling named “real-world” objects, like persons, companies, or locations.

 

Coreference resolution was applied to the processed text data using Neuralcoref, which is based on an underlying neural net scoring model. With coreference resolution, all common expressions that refer to the same entity were located within the text. All pronominal words in the text, such as her, she, he, his, them, their, and us, were replaced with the nouns to which they referred.

 

For example, consider this sample text:

“Farmers were caught in a flood. They were tending to their field when a dam burst and swept them away.”

Neuralcoref recognizes “Farmers”, “they”, “their” and “them” as referring to the same entity. The processed sentence becomes:

Farmers were caught in a flood. Farmers were tending to their field when a dam burst and swept farmers away.”

 

Coreference resolution of sample sentences

 

 

Document Classification

 

The objective of this phase was to build a model to categorize the articles in the News Article Corpus as either “Negative”, meaning they were not about conflict events, or “Positive”, meaning they were about conflict events.

After preparation of the articles in the News Article Corpus, as described in the previous section, the texts were then prepared for classification.

First, an Annotated Corpus was formed to train the classification model. A 1,600 article subset of the News Article Corpus was manually labeled as “Negative” or “Positive”.

To prepare the articles in both the News Article Corpus and Annotated Corpus for classification, the previously pre-processed text data of the articles was represented as vectors using the Bag of Words approach. With this approach, the text is represented as a collection, or “bag”, of the words it contains along with the frequency with which each word appears. The order of words is ignored.

For example, consider a text article comprised of these two sentences:

Sentence 1: “Zahra is sick with a fever.”

Sentence 2: “Arun is happy he is not sick with a fever.”

This text contains a total of ten words: “Zahra”, “is”, “sick”, “happy”, “with”, “a”, “fever”, “not”, “Arun”, “he”. Each sentence in the text is represented as a vector, where each index in the vector indicates the frequency that one particular word appears in that sentence, as illustrated below.

 

 

 

With this technique, each sentence is represented by a vector, as follows:

“Zahra is sick with a fever.”

[1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

“Arun is happy he is not sick with a fever.”

[0, 2, 1, 1, 1, 1, 1, 1, 1, 1]

With the Annotated Corpus vectorized with this technique, the data was used to train a logistic regression classifier model. The trained model was then used with the vectorized data of the News Article Corpus, to classify each article into Positive and Negative conflict categories.

The accuracy of the classification model was measured by looking at the percentage of the following:

  • True Positive: Articles correctly classified as relating to land conflicts
  • False Positive: Articles incorrectly classified as relating to land conflicts
  • True Negative: Articles correctly classified as not being related to land conflicts
  • False Negative: Articles incorrectly classified as not being related to land conflicts

 

The “precision” of the model indicates how many of those articles classified to be about the land conflict were actually about land conflict. The “recall” of the model indicates how many of the articles that were actually about the land conflict were categorized correctly. An f1-score was calculated from the precision and recall scores.

The trained logistic regression model successfully classified the news articles with precision, recall, and f1-score of 98% or greater. This indicates that produced a low number of false positives and false negatives.

 

Classification report using a test dataset and logistic regression model

 

 

Categorize by Land Conflicts Events

The objective of this phase was to build a model to identify the set of conflict events referred to in the collection of positive conflict articles and then to classify each positive conflict article accordingly.

A word cloud of the articles in the Gold Standard Corpus gives a sense of the content covered in the articles.

A topic model was built to discover the set of conflict topics that occur in the Positive conflict articles. We chose a semi-supervised approach to topic modeling to maximize the accuracy of the classification process. We chose to use CorEx (Correlation Explanation), a semi-supervised topic model that allows domain knowledge, as specified by relevant keywords acting as “anchors”, to guide the topic analysis.

To align with the Land Conflicts Policies provided by WRI, seven relevant core land conflicts topics were specified. For each topic, correlated keywords were specified as “anchors” for the topic.

 

 

 

The trained topic model provided 3 words for each of the seven topics:

  • Topic #1: land, resettlement, degradation
  • Topic #2: crops, farm, agriculture
  • Topic #3: mining, coal, sand
  • Topic #4: forest, trees, deforestation
  • Topic #5: animal, attacked, tiger
  • Topic #6: drought, climate change, rain
  • Topic #7: water, drinking, dams

The resulting topic model is 93% accurate. This scatter plot uses word representations to provide a visualization of the model’s classification of the Gold Standard Corpus and hand-labeled positive conflict articles.

 

Visualization of the topic classification of the Gold Standard Corpus and Positive Conflict Articles

 

 

Identify the Actors, Actions, Scale, Locations, and Dates

The objective of this phase was to build a model to identify the actors, actions, scale, locations, and dates in each positive conflict article.

Typically, names, places, and famous landmarks are identified through Named Entity Recognition (NER). Recognition of such standard entities is built-in with SpaCy’s NER package, by which our model detected the locations and dates in the positive conflict articles. The specialized content of the news articles required further training with “custom entities” — those particular to this context of land conlficts.

All the positive conflict articles in the Annotated Corpus were manually labeled for “custom entities”:

  • Actors: Such as “Government”, “Farmer”, “Police”, “Rains”, “Lion”
  • Actions: Such as “protest”, “attack”, “killed”
  • Numbers: Number of people affected by a conflict

This example shows how this labeling looks for some text in one article:

 

 

These labeled positive conflict articles were used to train our custom entity recognizer model. That model was then used to find and label the custom entities in the news articles in the News Article Corpus.

 

Match Conflicts to Relevant Policies

The objective of this phase was to build a model to match each processed positive conflict article to any relevant policies.

The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.

 

Excerpt of a 2001 policy document related to agriculture

 

 

A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

Cosine similarity calculates similarity by measuring the cosine of an angle between two vectors. Using the vectorized text of the articles and the policy documents that had been generated in the previous phases as described above, the model generated a collection of matches between articles and policies.

 

Visualization of Conflict Event and Policy Matching

The objective of this phase was to build a web-based tool for the visualization of the conflict event and policy matches.

An application was created using the Plotly Python Open Source Graphing Library. The web-based tool houses the models and allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The map displays land conflict events detected in the News Article Corpus for the selected years and regions of India.

Conflict events are displayed as color-coded dots on a map. The colors correspond to specific conflict categories, such as “Agriculture” and “ Environmental”, and actors, such as “Government”, “Rebels”, and “Civilian”.

In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.

 

 

 

By selecting a particular category from the right column, only those conflicts related to that category are displayed on the map. Here only the Agriculture-related subset of the events shown in the previous example is displayed.

 

 

News articles from the select years and regions are displayed below the map. When a particular article is selected, the location of the event is shown on the map. The text of the article is displayed along with policies matched to the event by the underlying models, as seen in the example below of a 2018 agriculture-related conflict in the Andhra Pradesh region.

 

 

Here is a closer look at the article and matched policies in the example above.

 

 

 

Next Steps

This overview describes the results of a pilot project to use natural language processing techniques to identify land conflict events described in news articles and match them to relevant government policies. The project demonstrated that NLP techniques can be successfully deployed to meet this objective.

Potential improvements include refinement of the models and further development of the visualization tool. Opportunities to scale the project include building the library of news articles with those published from additional years and sources, adding to the database of policies, and expanding the geographic focus beyond India.

Opportunities to improve and scale the pilot project

 

Improvements
  • Refine models
  • Further development of visualization tool

 

Scale
  • Expand library of articles with content from additional years and sources
  • Expand the database of policies
  • Expand the geographic focus beyond India

 

 

About the Authors

  • Laura Clark Murray is the Chief Partnership & Strategy Officer at Omdena. Contact: laura@omdena.com
  • Nikhel Gupta is a physicist, a Postdoctoral Fellow at the University of Melbourne, and a machine learning engineer with Omdena.
  • Joanne Burke is a data scientist with MUFG and a machine learning engineer with Omdena.
  • Rishika Rupam is a Data and AI Researcher with Tilkal and a machine learning engineer with Omdena.
  • Zaheeda Tshankie is a Junior Data Scientist with Telkom and a machine learning engineer with Omdena.

 

Omdena Project Team

Kulsoom Abdullah, Joanne Burke, Antonia Calvi, Dennis Dondergoor, Tomasz Grzegorzek, Nikhel Gupta, Sai Tanya Kumbharageri, Michael Lerner, Irene Nanduttu, Kali Prasad, Jose Manuel Ramirez R., Rishika Rupam, Saurav Suresh, Shivam Swarnkar, Jyothsna sai Tagirisa, Elizabeth Tischenko, Carlos Arturo Pimentel Trujillo, Zaheeda Tshankie, Gabriela Urquieta

 

Partners

This project was done in collaboration with Kathleen Buckingham and John Brandt, our partners with the World Resources Institute (WRI).

 

 

About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through global bottom-up collaboration. Omdena is a partner of the United Nations AI for Good Global Summit 2020.

Using Topic Modeling and Coreference Resolution to Identify Land Conflicts and Its Causes

Using Topic Modeling and Coreference Resolution to Identify Land Conflicts and Its Causes

Improving the accuracy score from 83% to 93% to identify land conflict topics in news articles.

 

Identifying environmental conflict events in India using news media articles

 

Part of this project was to scrape news media articles to identify environmental conflict events such as resource conflicts, land appropriation, human-wildlife conflict, and supply chain issues.

With an initial focus on India, we also connected conflict events to their jurisdictional policies to identify how to resolve those conflicts faster or to identify a gap in legislation.

Part of the pipeline in building this Language Model was a semi-supervised attempt in order to be Improving Topic Modeling Performance to increase environmental sustainability, whose process and the outcome are available here.

In short, in order to make this Topic Modeling model robust, Coreference Resolution was suggested as one of the possible additions.

 

The Solution

What exactly is Coreference Resolution?

Coreference resolution is the task of finding all expressions that refer to the same entity in a text (1)

 

Use Cases

  1. In the context of this project, Coreference Resolution could be best used in order to Improving Topic Modeling Performance by replacing references with the same entity in order to better model the actual meaning of the text. This increases the Tf-Idf of generalized entities and it removes ambiguous words that are meaningless for classification.
  2. Another use-case would be to use the Coreferenced text data as additional features, along with Named Entity Recognition tags, in any classification approach. A one-hot-encoded version of unique entities can be used as input to factorization machines or other approaches for spare modeling.

 

Which packages are available to implement it?

 

An interpretation of a girl with magnifying glass looking for python packages

Exploring almost every available python package out there.

 

We toyed around with some packages which seemed good in theory but were rather challenging to apply to our specific task. We needed a package that would be user-friendly, as a script would have to be developed for 28 people to take and be able to apply without much struggle.

NeuralCoref, Stanford NLP, Apache Open NLP, and Allennlp. After trying out each package, I personally preferred Allennlp, but as a team, we decided to use NeuralCoref with a short but effective script written by one of the collaborators Srijha Kalyan.

The code was applied to the article data which was annotated by fellow collaborators from the Annotation Task Group. This resulted in a CSV file with the original article titles, the original article text, and a new column of Coreference article text; not as chains but in the same written format as the original article text.

 

An image of a table containing various fields of data regarding land conflicts

 

The output was then sent to the Topic Modeling Task Team, which at that point was sitting on an accuracy of 83%, with the Coreference Resolution data, the accuracy jumped to 93%.

That’s an 11% improvement! All the hard work and hours were clearly worth it!

 

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Named Entity Recognition with SpaCy to Identify Actors and Actions in News Articles

Named Entity Recognition with SpaCy to Identify Actors and Actions in News Articles

Identifying actors and actions in news articles about land conflicts in India. The work has been part of an Omdena AI project with the World Resources Institute on identifiying land use conflicts and matching them with mediating government policies.

 

Suppose we have the following excerpt from a news article:

 

 

We want to identify within the article the following key elements (entities):

  • Actor — who/what are the main actor(s) in the conflict referred to in this article?
  • Action — what is the main action or event of a conflict in this article?

As human beings, this task is fairly simple — we would identify ‘tiger’, ‘farmer’ and ‘forest officials’ as the ‘actors’ and ‘attacked’ as the ‘action’. Things get a bit murky when it comes to defining ‘action’ in certain contexts (would you identify ‘tranquilize’ as the main action or not?). Overall humans would more or less agree on what the ‘actor’ and ‘action’ items are.

A model that can do this will be deemed a successful Named Entity Recognizer with SpaCy.

 

 

Pretty good, don’t you think? If you are curious how this works, read ahead!

 

The Problem: Resolving land conflicts in India 

Typically, Named Entity Recognition (NER) happens in the context of identifying names, places, famous landmarks, year, etc. These entities come built-in with standard Named Entity Recognition packages like SpaCy, NLTK, AllenNLP.

The challenge for us was to create a custom entity recognizer as our entities were ‘non-standard’ and needed to be adapted to the AI challenge.

The World Resources Insitute (WRI) had approached Omdena to further its project on identifying land-related environmental conflicts in India, which affect more than 7 million people.

 

 

The idea was to identify where the conflicts were happening, what groups of people it was affecting, the scale of the conflicts and to classify the kinds of conflicts and match it with the related governmental policy to resolve them faster.

 

Among these, identifying groups of people, scale, action, location, and date came under the scope of Named Entity Recognition using SpaCy.

In this article, we will deal with identifying actors, actions, and scales. Location and date are standard entities that can be obtained by plug-and-playing an off-the-shelf entity recognizer.

 

The data

The raw data initially was about 65000 news articles from Indian newspapers obtained from GDELT. In its own words, GDELT is ‘Creating a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day and that stretches back to January 1, 1979, through present day.’ All the text was either originally in English or translated to English by GDELT.

 

The Solution: Coreference resolution

An important milestone identified before we started our labeling process was to identify the need for coreference resolution. Consider this fictional text,

‘Farmers were caught in a flood in Maharashtra. Kabir Narayan and Kamal Bashir were tending to their field when a dam burst and swept them away’.

Here, ‘Farmers’, ‘Kamal Narayan’, and ‘Kamal Bashir’ refer to the same entity. However, an entity recognizer will typically treat them as three separate entities. We wanted our entity recognizer to identify them all as ‘farmers’. This is where coreference resolution comes in. Coreference resolution is this essential pre-step in the entity recognition process that identifies entities ‘Kabir Narayan’ and ‘Kamal Bashir’ as referring to the same entity ‘farmer’ that occurs before. We won’t be able to go into any depths about how coreference resolution works. If you’re interested, here’s a useful blog that explains coreference resolution and also shows how to use spaCy’s coreference package, which is also what we used in our solution. Here’s also a blog by Zaheeda Tshankie, the task manager for the coreference resolution task — her take on what coreference resolution looked like in this particular case.

Some subtleties regarding entity labeling.

The next important step in this task was to manually label our entities. In order to train the model, Named Entity Recognition using SpaCy’s advice is to train ‘a few hundred’ samples of text. As it turned out in our case, we had manually identified about 1300 articles as either ‘positive’, i.e. as indeed referring to an environmental conflict or ‘negative’. In the beginning, we aimed to label 500 of these with our custom entities. However, we realized that this was not the easiest or the most suitable task. Here is some subtlety specific to entity recognition tasks — not all texts are suitable for all entity identification. For example, consider this text: ‘India is home to several hundred species of birds’. In this piece of text, it is difficult to identify the ‘action’. This is a descriptive text with no conflict that can be labeled as an ‘action’. For this reason, we decided to restrict our attention to the positive articles only. There were 147 of them.

There is a further subtlety regarding potentially nebulous entities such as ‘action’. From the beginning, the instructions were clear: we were to identify and label only the ‘main action’ of any news article. But, as we realized, this can be a fairly subjective task. For instance, consider the following text.

 

A paragraph explaining about a topic

 

During the labeling, we encountered articles such as the one above. One example of labeling is as shown. This is not incorrect, however, I would have probably labeled this differently, marking only ‘killed’ as the ‘action’, ‘elephants’ and ‘tigress’ as ‘actors’. When we are working with several people during labeling, we have to account for the fact that people may misunderstand rules, through no fault of their own. Rather, the onus is on the rules and the more precise the rules are, the better the labeling process goes. This was a lesson well learned. However, sometimes even when the rules are precise, it is still possible to hit some ‘grey areas’ where it’s difficult to be completely objective and the subjectivity of the labeler comes into play. This is an inherent feature of ‘ambiguous’ labels like action and I am not sure if I have a solution to this. If you have any thoughts on this, please do leave them in the comments.

 

Pre-built entity recognizers

There are several libraries that have been pre-trained for Named Entity Recognition, such as SpaCy, AllenNLP, NLTK, Stanford core NLP. We decided to opt for spaCy because of two main reasons — speed and the fact that we can add neural coreference, a coreference resolution component to the pipeline for training.

If you would like a more detailed comparison of  Named Entity Recognition, such as SpaCy libraries, here’s a blog on it.

 

Using Doccano

In order to make the labeling task as easy and efficient as possible, we decided to use Doccano’s annotating tool. Their description is as follows — ‘Doccano is an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotation. You can build dataset in hours.’.

Here is what it looks like in practice.

 

 

Converting JSON1 to SpaCy format

Doccano provides entities in a JSON1 format and we needed to convert it to a tuple format that spaCy accepts. In the following, you can see the code. Credits to Tomasz Grzegozek.

import json
#Converting JSON1 files to Spacy tuples format
def convert_doccano_to_spacy(filepath):
with open(filepath, ‘rb’) as fp:
data = fp.readlines()
training_data = []
for record in data:
entities = []
read_record = json.loads(record)
text = read_record[‘text’]
entities_record = read_record[‘labels’]
for start, end, label in entities_record:
entities.append((start, end, label))
training_data.append((text, {“entities”: entities})
return training_data

 

Training the model

Here we used the following block of code, inspired by this blog.

 

TRAIN_DATA = train
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank(‘en’) # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if ‘ner’ not in nlp.pipe_names:
ner = nlp.create_pipe(‘ner’)
nlp.add_pipe(ner, last=True)
#nlp.add_pipe(nlp.create_pipe(‘sentencizer’)) 
#Adding sentencizer as a prerequisite to coref
#neuralcoref.add_to_pipe(nlp) #Adding corefering in the pipeline
 ner.add_label(ent[2])
# get names of other pipes to disable them during training
 other_pipes = [pipe for pipe in nlp.pipe_names if pipe != ‘ner’]

with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
  print(“Starting iteration “ + str(itn))
  random.shuffle(TRAIN_DATA)
  losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], 
# batch of texts[annotations], 
# batch   of annotations
drop=0.2, 
# dropout — make it harder to memorise data
sgd=optimizer, 
# callable to update weights
   losses=losses)
   print(losses)
   return nlp
custom_ner = train_spacy(TRAIN_DATA, 20)
# Save our trained Model
custom_ner.to_disk(‘Custom_NER_Model’)

 

Conclusion

The results of the training gave us some pretty good results. The model was especially good at picking up ‘actor’.

 

 

There were failures by the model, too. Here is an example.

 

 

In the example above, the model misses ‘massive protest’ as the important action and instead, identifies a long piece of text (which could be considered a secondary action) as the main action.

As mentioned before, defining ‘action’ is ambiguous even for humans, so it’s no wonder that the model got it wrong a few times. I do believe that with stricter rules for labeling, the model would have performed better.

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here