Overcoming Data Challenges through the Power of Diverse & Collaborative Teams

Overcoming Data Challenges through the Power of Diverse & Collaborative Teams

In this demo day, we talked about the inevitable data challenges/roadblocks that come up in real-world AI projects. The insights shared came from our experiences with more than 20 AI projects, working with partners including the UN Refugee Agency (UNHCR), the World Resources Institute, the World Energy Council, and numerous NGOs and corporations.

Omdena is a collaborative platform to build innovative, ethical, and efficient AI solutions to real-world problems. Since our founding in May 2019, over 1250 AI experts from more than 80 countries have come together on Omdena projects to address significant issues related to hunger, sexual harassment, land conflicts, gang violence, wildfire prevention, and energy poverty.

We’ve seen that the way that we approach AI development, via bottom-up collaboration with diverse team members, fosters innovation and creativity which leads to the breakdown of data roadblocks. Innovation is inherent in the Omdena process.

We shared three Omdena projects to act as case studies for these innovative approaches to tackling data challenges.

 

Data Roadblock 1: Incomplete Data Sets

In the real world, datasets are rarely complete. We find having large teams of dozens of people means that data gathering, cleaning, and wrangling happen at a phenomenal speed. And by taking a bottom-up approach, we have multiple sub-teams looking at data problems from different angles, allowing for innovative approaches to be explored.

In the following case study, the Omdena team worked out ways to identify safe routes in a city in the aftermath of an earthquake, where the relevant data sets were inconsistent and unreliable.

 

Case Study : Disaster Response: Improving the Aftermath Management of an Earthquake

In collaboration with Istanbul’s Impact Hub innovation center, Omdena data scientists combined satellite imagery of Istanbul with street map data in order to build a tool that facilitates family reunification by indicating the shortest and safest route between two points after an earthquake.

“Omdena´s approach to AI development is by far the best that I have seen in 2019” — Semih Boyaci, Co-Founder Impact Hub Istanbul

You can learn more about this project here:

 

 

Data Roadblock 2: No Data

We don’t see the lack of data as a showstopper. On those projects without data, the team starts by asking what do we need to know to address the problem? Where might that data live? If it doesn’t exist, how can we create it from something that does exist? Here the diversity of the team members is very powerful.

We’ve seen time and again the impact of bringing together people with vastly different professional and life experiences. Our teams are typically 30% or more female. On any project, we’ll have on average 14 countries represented. Our collaborators range in age from 17 to 65. Not only does this diversity lead to ethical and trusted solutions, but it also fosters creativity and alternative ideas about what data is relevant and where to find it.

In the following project, we looked at how to assess post-traumatic stress disorder among those that have suffered trauma in low-resource environments. In this case, the team started with no data in-hand.

 

Case Study : Building a chatbot for Post-traumatic-stress-disorder (PTSD) assessment

32 Omdena collaborators developed a machine learning-driven chatbot for PTSD assessment in war and refugee zones.

 

The unique aspect of the project was that we did not start with a data set.

Through the collaborative efforts of the project community, the team identified and annotated suitable patient data. The teams applied linear classifiers for Natural Language Processing (NLP) for PTSD risk assessment and transfer learning for data augmentation.

You can learn more about this project here:

 

Data Roadblock 3: Disparate Data Sources

Relevant data doesn’t typically come packaged in just one form. We often need to meld disparate data sources to get at a solution. Through collaboration, sub-teams focused on separate data and AI techniques come together to integrate those efforts to derive insights about the problem.

In the following project, the goal was to uncover domestic violence in India hidden due to COVID lockdowns. Among the many challenges the team addressed was the integration of data culled from disparate sources.

 

Case Study : Analyzing Domestic Violence through Natural Language Processing

This project was done with the award-winning Red Dot Foundation. Within Omdena’s collaborative platform, the team looked craft a dataset to reveal domestic violence and online harassment patterns in India during COVID-19 lockdowns. The AI experts scrapped data from news articles as well as social media to apply various natural language processing (NLP) techniques such as topic modeling, document annotations, and stacking machine learning models.

 

 

You can learn more about this and related projects here:

 

 

 

More about Omdena

Omdena is the collaborative platform to build innovative, ethical, and efficient AI and Data Science solutions to real-world problems. 

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

| Demo Day Insights | Matching Land Conflict Events to Government Policies via Machine Learning

By Laura Clark Murray, Joanne Burke, and Rishika Rupam

 

A team of AI experts and data scientists from 12 countries on 4 continents worked collaboratively with the World Resources Institute (WRI) to support efforts to resolve land conflicts and prevent land degradation.

The Problem: Land conflicts get in the way of land restoration

Among its many initiatives, WRI, a global research organization, is leading the way on land restoration — restoring land that has lost its natural productivity and is considered degraded. According to WRI, land degradation reduces the productivity of land, threatening the economy and people’s livelihoods. This can lead to reduced availability of food, water, and energy, and contribute to climate change.

Restoration can return vitality to the land, making it safe for humans, wildlife, and plant communities. While significant restoration efforts are underway around the world, local conflicts get in the way. According to John Brandt of WRI, “Land conflict, especially conflict over land tenure, is a really large barrier to the work that we do around implementing a sustainable land use agenda. Without having clear tenure or ownership of land, long-term solutions, such as forest and landscape restoration, often are not economically viable.”

 

Photo credit: India’s Ministry of Environment, Forest and Climate Change

Photo credit: India’s Ministry of Environment, Forest and Climate Change

 

And though governments have instituted policies to deal with land conflicts, knowing where conflicts are underway and how each might be addressed is not a simple task. Says Brandt, “Getting data on where these land conflicts, land degradation, and land grabs occur is often very difficult because they tend to happen in remote areas with very strong language barriers and strong barriers around scale. Events occur in a very distributed manner.” WRI turned to Omdena to use AI and natural language processing techniques to tackle this problem.

 

The Project Goal: Identify news articles about land conflicts and match them to relevant government policies

 

Impact

“We’re very excited that the results from this partnership were very accurate and very useful to us.

We’re currently scaling up the results to develop sub-national indices of environmental conflict for both Brazil and Indonesia, as well as validating the results in India with data collected in the field by our partner organizations. This data can help supply chain professionals mitigate risk in regards to product-sourcing. The data can also help policymakers who are engaged in active management to think about what works and where those things work.” — John Brandt, World Resources Institute.

 

The Use Case: Land Conflicts in India

In India, the government has committed 26 million hectares of land for restoration by the year 2030. India is home to a population of 1.35 billion people, has 28 states, 22 languages, and more than 1000 dialects. In a land as vast and varied as India, gathering and collating information about land conflicts is a monumental task.

The team looked to news stories, with a collection of 65,000 articles from India for the years 2017–2018, extracted by WRI from GDELT, the Global Database of Events Language and Tone Project.

 

Identifying news articles about land conflicts

Land conflicts around land ownership include those between the government and the public, as well as personal conflicts between landowners. Other types of conflicts include those between humans and animals, such as humans invading habitats of tigers, leopards, or elephants, and environmental conflicts, such as floods, droughts, and cyclones.

 

 

The team used natural language processing (NLP) techniques to classify each news article in the 65,000 article collection as pertaining to land conflict or not. While this problem can be tackled without the use of any automation tools, it would take human beings years to go through each article and study it, whereas, with the right machine or deep learning model, it would take mere seconds.

A subset of 1,600 newspaper articles from the collection was hand-labeled as “positive” or “negative”, to act as an example of proper classification, or example of proper classification. For example, an article about a tiger attack would be hand-labeled as “positive”, while an article about local elections would be labeled as “negative”.

To prepare the remaining 63,400 articles for an AI pipeline, each article was pre-processed to remove stop words, such as “the” and “in”, and to lemmatize words to return them to their root form. Co-referencing pre-processing was used to increase accuracy. A topic modeling approach was used to further categorize the “positive” articles by the type of conflict, such as Land, Forest, Wildlife, Drought, Farming, Mining, Water. With refinement, the classification model achieved an accuracy of 97%.

 

 

With the subset of land conflict articles successfully identified, NLP models were built to identify four key components within each article: actors, quantities, events, and locations. To train the model, the team hand-labeled 147 articles with these components. Using an approach called Named Entity Recognition, the model processed the database of “positive” articles to flag these four components.

 

 

 

Matching land conflict articles to government policies

Numerous government policies exist to deal with land conflicts in India. The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.

 

 

A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

The Omdena team built a visual dashboard to display the land conflict events and the matching government policies. In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.

 

 

Underlying this dashboard are the NLP models that classify news articles related to land conflict, and land degradation, and match them to the appropriate government policy.

 

 

The results of this pilot project have been used by the World Resources Institute to inform their next stage of development.

Join one of our upcoming demo days to see the power of Collaborative AI in action.

Want to watch the full demo day?

Check out the entire recording (including a live demonstration of the tool).

 

NLP Clustering to Understand Social Barriers Towards Energy Transition | World Energy Council

NLP Clustering to Understand Social Barriers Towards Energy Transition | World Energy Council

Using NLP clustering to better understand the thoughts, concerns, and sentiments of citizens in the USA, UK, Nigeria, and India about energy transition and decarbonization of their economies. The following article shares observatory results on how citizens of the world perceive their role within the energy transition. This includes associated social risks, opportunities, and costs.

The findings are part of a two-month Omdena AI project with the World Energy Council (WEC). None of the findings are conclusive but observative taking into account the complexity of the analysis scope.

 

The Project Goal

The aim was to find information that can help governments to effectively involve people in the accelerating energy transition. The problem was quite complicated and there was no data provided to us. Therefore, we were supposed to create our own data-set, analyze it, and provide WEC with insights. We started with a long list of open questions such as:

  • What should our output look like?
  • What search terms would be useful to scrape data for?
  • What countries should be considered as our main focus?
  • Should we consider non-English languages as well and analyze them?
  • How much data per country will be enough?
  • Etc.

In order to meet the deadline for the project, we decided to go with the English language only and come up with good working models.

 

The Solution

 

Getting data from Social Media

We scraped the following resources: Twitter, YouTube, Facebook, Reddit, and famous newspapers specific to each country. Desired insights should cover developed, developing, and under-developed countries and the emphasis was specifically on developing, and under-developed countries.

The results discussed in this article obtained from scraped tweet data and for USA, UK, India, and Nigeria which cover the three categories of developed, developing, and under-developed countries.

 

Our Approach: Trying different NLP techniques

We first gathered data by scraping tweets using several specific keywords we found to be important for specific countries using google trends. I added stop-words, stemming, removed hashtags, punctuation, numbers, mentions, and replaced URLs with _URL. I used TF-IDF vectorization for feature extraction of the articles. I am going to walk you through various steps taken to tackle the problem.

 

Approach 1: Sentiment Analysis (Non-satisfactory)

Sentiment analysis of short tweets data comes with its own challenges and some of the important challenges we were facing for this project were:

  • Tags mean different things in different countries. #nolight can be Canadians complaining about the winter sunset, or Nigerians having a power cut.
  • Tags take a side. For example, #renewables is pro-green and #climatehoax is not. So positive sentiment on #renewables might not really tell us much.
  •  The classifier model built on #climatechange and related tags do not work at all on the anti-green tags such as #climatemyth.
  • Some anti-green tweets are full of happy emojis which makes the sentiments unreliable.
  • The major tweeting countries are overwhelmingly positive. In fact, the distribution of climate change-related tweets across the world is not uniform and the number of tweets across some countries is much more prevalent in the data-set as compared to others (Figure1) [1].
  • The interpretation of outputs. In fact, by just assigning labels to each tweet we will not be able to derive insights on the barriers to the energy transition. Therefore, the interpretability of the model is very important.

Considering all the challenges discussed, the sentiment analysis of the tweets did not produce satisfactory results (Table1) and we decided to test other models.

 

 

Number of climate change related tweets per country [1]

Figure1: Number of climate change related tweets per country [1]

 

 

Classifier accuracy for sentiment analysis of tweets data (USA)

Table1: Classifier accuracy for sentiment analysis of tweets data (USA)

 

 

Approach 2: Topic Modeling (Unsatisfactory) 

Topic modeling is an NLP technique that provides a way to compare the strength of different topics and tells us which topic is much more informative as compared to others. Topic models are unsupervised models with no need for data labeling. Because tweets are short it was really hard to differentiate between different topics and also correspond them to a specific topic using models such as LDA. Topic models tend to produce the best results when applied to texts that are not too short and those that have a consistent structure.

 

1. Using a semi-supervised approach

We chose a semi-supervised topic modeling approach (CorEX) [2]. Since the data was very high dimensional, we applied dimensionality reduction in order to remove noise and interpret the data. Permutation Test is used to determine the optimum number of principal components required for PCA [3,4]. From the explained variance ratio plot, it appeared that the cumulative explained variance line is not perfectly linear, but it is very close to a straight line.

Through permutation tests, I noticed that the mean of the explained variance ratio of permuted matrices did not really differ from the explained variance ratio of the non-permuted matrix which suggested that applying PCA on correlated topic model’s results were not helpful at all.

 

 

 

 

This means each of the principal components contributes to the variance explanation almost equally, and there’s not much point in reducing the dimensions based on PCA.

 

2. Identifying 20 important topics

The CorEx results showed that there are about 20 important topics and it was also showing the important words per topic. But how to interpret the results?

Data was very high dimensional and dimensionality reduction was not helpful at all. For example, if price, electricity, ticket, fuel, gas, and skepticism are the most important words for one topic how to understand the concerns of the people of that country? Is it fuel price that is of concern to them? Or electricity prices, or ticket prices? There could be a combination of many different possibly related words in each topic and by just looking at the important words in each topic, it would not be possible to find out what is the story behind data to harness clean energy for a better future.

Besides, bigrams or trigrams with topic models did not help much either because not the main keywords conveying the main focus of the tweet might always appear together.

 

 

 

 

Approach 3: Clustering (Kmeans & Hierarchical)

Both Kmeans and Hierarchical clustering models lead to comparable results illustrating separate clear clusters. Because both models have comparable performance, we derived all results using Hierarchical clustering which better shows the hierarchy of the clusters. Tweet data were collected for four different countries as discussed before and the model was applied to the data of each country separately to analyze the results. To summarize we only show the clustering results for India. But all the insights across countries are shown at the end of the article.

 

 

 

 

Hierarchical Clustering Results

After finding clear clusters from the data, the next step was interpreting the data by creating meaningful visualizations and insights. A combination of Scattertext, co-occurrence graph, dispersion plot, colocated word clouds, and top trigrams resulted in very useful insights from data to harness clean energy for a better future.

An important lesson to point out here is to always rely on a combination of various plots for your interpretations instead of only one. Each type of plot helps us visualize one aspect of data and combining various plots together helps to create a comprehensive clear picture from data.

 

 

1. Using Scattertext

Scattertext is an excellent exploratory text analysis tool that allows cool visualizations differentiating between the terms used by different documents using an interactive scatter plot.

Two types of plots were created which was very helpful in interpreting the results.

1) Visualizing word embedding projections. This has been explored using word association with a specific keyword. The keywords include the following: [Access, Availability, Affordability, Bills, Prices]. If the reader is interested, they can try more keywords using the provided code in this study.

2) In another plot, the uni-grams from the clustered tweets are selected and plotted using their dense-ranked category-specific frequencies. We used this difference in dense ranks as the scoring function.

All the interactive plots are stored in an HTML file and are available in the GitHub repository. If you click on the interactive version, the list of tweets with each specific term can be explored. Please note that first hierarchical clustering is applied to the data and then the clustered tweets are given to Scattertext as input. You can gain further information by diving deep into these plots. The data used for creating these results can be found here and the notebook to apply to cluster and create these scatter plots can be found here.

The following shows the interactive versions of all plots for various countries:

 

1.1. Rank and frequencies across different categories (India)

 

 

 An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 2. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 2. See Cluster 1 vs 2 for an interactive version of this plot.

Figure 8. An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 2. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 2. See Cluster 1 vs 2 for an interactive version of this plot.

 

 

An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 3. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 3. See Cluster 1 vs 3 for an interactive version of this plot.

Figure 9. An example Scattertext plot showing positions of terms based on the dense ranks of their frequencies, for cluster 1 & 3. The scores are the difference between the terms’ dense ranks. The bluer terms are, the higher their association scores are for cluster 1. The redder the terms, the higher their association score is for cluster 3. See Cluster 1 vs 3 for an interactive version of this plot.

 

 

1.2. Word embedding projection plots using Scattertext (India)

 

 

An example Scattertext plot showing word associations to term prices using Spacy’s pretrained embedding vectors. This is used to see the terms most associated with the term prices. At the top right corner, we see the most commonly associated words with the term prices such as electricity. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Bills for an interactive version of this plot.

Figure 10. An example Scattertext plot showing word associations to term prices using Spacy’s pre-trained embedding vectors. This is used to see the terms most associated with the term prices. At the top right corner, we see the most commonly associated words with the term prices such as electricity. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Bills for an interactive version of this plot.

 

 

 An example Scattertext plot showing word associations to term bills using Spacy’s pretrained embedding vectors. This is used to see the terms most associated with the term bills. At the top right corner, we see the most commonly associated words with the term bills such as electricity, prices, energy, power. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Prices for an interactive version of this plot.

Figure 11. An example Scattertext plot showing word associations to term bills using Spacy’s pretrained embedding vectors. This is used to see the terms most associated with the term bills. At the top right corner, we see the most commonly associated words with the term bills such as electricity, prices, energy, power. If you click on the interactive version, the list of tweets with the terms can be explored. See Word Embedding: Prices for an interactive version of this plot.

 

 

2. Twitter Insights (Price & Energy Transition Concerns)

 

2.1. India
  • Solar and wind don’t necessarily mean cheaper prices as it did not cause so in Germany. When Germany went all on renewables, energy prices and carbon emissions went up.
  • The electrical prices can drop for people who are sourcing power from the government-owned renewable sources because the prices are not going to vary with oil and natural gas.
  • Renewable energy policy can lead to much lower electricity prices, a stronger globally competitive economy, less import of fossil fuels, and as a result less pollution.
  • Putting a tax on coal and making open access a reality are two potential action areas to make renewable energy affordable.
  • Let oil prices increase and subsidies stop.
  • Many requests to replace fossil fuels with cleaner fossil fuels such as stubbles from farmers.
  • Cut oil imports and encourage renewable energies.
  • A lot of complaints regarding electricity shortage, lack of electricity for hours or days, electricity cut, electricity, and water supply.
  • Fossil fuels are dirty, and Nuclear power is dangerous. Therefore, we need to make renewable energy work or harness clean energy for a better future.

 

2.2. Nigeria
  • People complaining about no constant electricity, and zero business-friendly policy.
  • Enhancing the delivery of electricity in the country.
  • Whenever it rained electricity supply was cut off for days, lack of electricity every weekend daily and overnight, and unstable electricity.
  • No water and no electricity.
  • The electricity sector is the third main consuming sector of oil.
  • Lots of worries and trouble regarding paying electricity bills.
  • Access to electricity is not for everyone.
  • Access to affordable sustainable renewable energy.
  • Renewable energy water and waste management are some of Nigeria’s major partnership areas with Ghana.
  • Harnessing tidal or offshore wind energy which is a clean and renewable source.
  • Lots of positive experiences and low prices with the usage of Solar power systems.

 

2.3. UK

  • Bringing down the prices of electricity and gas.
  • Having stable prices for electricity.
  • People prefer higher prices for gas than electricity.
  • Need to think beyond electricity to affect the energy transition.
  • Renewables disrupt the electricity market and politicians raising electricity prices to tackle climate emergency problems is an awful policy.
  • A lot of requests on investment in Renewable Energies.
  • The transition to renewable is being too slow.
  • Lots of discussions on whether it is good to replace the nuclear stations with renewables.
  • Whether the zero-carbon economy has any economic benefit for the UK.

 

2.4. USA

  • Slowing down climate change.
  • Market-based solutions for climate change.
  • Renewable energy infrastructure is lame and unreliable.
  • Renewables increase electricity prices and distort energy markets with favorable purchase agreements.
  • Many complaints regarding gas prices.
  • National security’s priority should be on renewable energy Investing in its infrastructure and jobs progs.
  • Figure out how to store renewable energy and get rid of excess CO in the atmosphere.
  • Renewable energy represents a significant economic opportunity.

 

 

3. Weighing a word´s importance via Dispersion Plot

A word’s importance can be weighed by its dispersion in a corpus. Lexical dispersion is a measure of a word’s homogeneity across the parts of a corpus. The following plot notes how many times a word occurs throughout the entire corpus for different countries including India, Nigeria, UK, and the USA.

According to the following dispersion plot, access to electricity is an important concern for Nigeria while this is not the case for the other three countries. How do we know that this access is related to electricity? Well, the answer is Scattertext plots shown in the previous section. Analyzing those plots together with the dispersion plot shows that the concern is electricity access.

Access to affordable renewable energy is a big concern in Nigeria and then India, while the affordability of renewable energy is not a problem for people in the UK and the USA. Affordability is a big concern for the people in Nigeria and people have difficulty paying their electricity bills.

Energy, electricity, power, and renewables are also the topic of most of the discussions in all of these countries. But what aspects of each topic are of concern to each country? The answer is given in the previous section where we interpret the results of Scattertext plots.

 

 

Lexical dispersion for various keywords across different countries

Figure 12. Lexical dispersion for various keywords across different countries

 

 

4. Top Trigrams for Different Countries

 

 

Top twenty trigrams for India

Figure 13. Top twenty trigrams for India

 

 

As can be seen from the top 20 trigrams for India the top concerns are Renewable energy, Renewable energy sector, Renewable energy capacity, Renewable energy sources, New renewable energy, and clean renewable energy. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

Top twenty trigrams for Nigeria

Figure 14. Top twenty trigrams for Nigeria

 

 

As can be seen from the top 20 trigrams for Nigeria the top concerns are Renewable energy, Renewable energy training, Electricity distribution companies, Renewable energy sources, Renewable energy solutions, Solar renewable energy, Renewable energy sector, Affordable prices, Power Supply, Climate change renewables, Public-private sectors, Renewable energy industry, Renewable energy policies, and Access to renewable energy. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

Top twenty trigrams for UK

Figure 15. Top twenty trigrams for UK

 

 

As can be seen from the top 20 trigrams for the United-Kingdom the top concerns are Free renewable energy, Renewable energy sources, Using renewable energy, New renewable energy. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

 Top twenty trigrams for USA

Figure 16. Top twenty trigrams for USA

 

 

As can be seen from the top 20 trigrams for the USA the top concerns are Clean renewable energy, Renewable energy sources, Supporting renewable energy, Renewable fuel standard, Transition into renewable energy, Solar renewable energy, New renewable energy, Using renewable energy, Need for quality products, and renewable energy jobs. These top concerns specifically match the insights drawn from clustering in the previous section.

 

 

5. Collocated word clouds & Co-occurrence Network

The following plots display the networks of co-occurring words in tweets in different countries. Here, we visualize the network of top 25 occurring bigrams. The connection between the words confirms the insight derived in the previous section for all cases.

 

 

 Collocate Clouds-India

Figure 17. Collocate Clouds-India

 

 

Co-occurrence Network-India (First 25 Bigrams)

Figure 18. Co-occurrence Network-India (First 25 Bigrams)

 

 

Collocate Clouds-Nigeria

Figure 19. Collocate Clouds-Nigeria

 

 

Co-occurrence Network-Nigeria (First 25 Bigrams)

Figure 20. Co-occurrence Network-Nigeria (First 25 Bigrams)

 

 

Collocate Clouds-UK

Figure 21. Collocate Clouds-UK

 

 

Co-occurrence Network-UK (First 25 Bigrams)

Figure 22. Co-occurrence Network-UK (First 25 Bigrams)

 

 

Collocate Clouds-USA

Figure 23. Collocate Clouds-USA

 

 

Co-occurrence Network-USA (First 25 Bigrams)

Figure 24. Co-occurrence Network-USA (First 25 Bigrams)

 

 

 

 

 

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Using Topic Modeling and Coreference Resolution to Identify Land Conflicts and Its Causes

Using Topic Modeling and Coreference Resolution to Identify Land Conflicts and Its Causes

Improving the accuracy score from 83% to 93% to identify land conflict topics in news articles.

 

Identifying environmental conflict events in India using news media articles

 

Part of this project was to scrape news media articles to identify environmental conflict events such as resource conflicts, land appropriation, human-wildlife conflict, and supply chain issues.

With an initial focus on India, we also connected conflict events to their jurisdictional policies to identify how to resolve those conflicts faster or to identify a gap in legislation.

Part of the pipeline in building this Language Model was a semi-supervised attempt in order to be Improving Topic Modeling Performance to increase environmental sustainability, whose process and the outcome are available here.

In short, in order to make this Topic Modeling model robust, Coreference Resolution was suggested as one of the possible additions.

 

The Solution

What exactly is Coreference Resolution?

Coreference resolution is the task of finding all expressions that refer to the same entity in a text (1)

 

Use Cases

  1. In the context of this project, Coreference Resolution could be best used in order to Improving Topic Modeling Performance by replacing references with the same entity in order to better model the actual meaning of the text. This increases the Tf-Idf of generalized entities and it removes ambiguous words that are meaningless for classification.
  2. Another use-case would be to use the Coreferenced text data as additional features, along with Named Entity Recognition tags, in any classification approach. A one-hot-encoded version of unique entities can be used as input to factorization machines or other approaches for spare modeling.

 

Which packages are available to implement it?

 

An interpretation of a girl with magnifying glass looking for python packages

Exploring almost every available python package out there.

 

We toyed around with some packages which seemed good in theory but were rather challenging to apply to our specific task. We needed a package that would be user-friendly, as a script would have to be developed for 28 people to take and be able to apply without much struggle.

NeuralCoref, Stanford NLP, Apache Open NLP, and Allennlp. After trying out each package, I personally preferred Allennlp, but as a team, we decided to use NeuralCoref with a short but effective script written by one of the collaborators Srijha Kalyan.

The code was applied to the article data which was annotated by fellow collaborators from the Annotation Task Group. This resulted in a CSV file with the original article titles, the original article text, and a new column of Coreference article text; not as chains but in the same written format as the original article text.

 

An image of a table containing various fields of data regarding land conflicts

 

The output was then sent to the Topic Modeling Task Team, which at that point was sitting on an accuracy of 83%, with the Coreference Resolution data, the accuracy jumped to 93%.

That’s an 11% improvement! All the hard work and hours were clearly worth it!

 

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Named Entity Recognition with SpaCy to Identify Actors and Actions in News Articles

Named Entity Recognition with SpaCy to Identify Actors and Actions in News Articles

Identifying actors and actions in news articles about land conflicts in India. The work has been part of an Omdena AI project with the World Resources Institute on identifiying land use conflicts and matching them with mediating government policies.

 

Suppose we have the following excerpt from a news article:

 

 

We want to identify within the article the following key elements (entities):

  • Actor — who/what are the main actor(s) in the conflict referred to in this article?
  • Action — what is the main action or event of a conflict in this article?

As human beings, this task is fairly simple — we would identify ‘tiger’, ‘farmer’ and ‘forest officials’ as the ‘actors’ and ‘attacked’ as the ‘action’. Things get a bit murky when it comes to defining ‘action’ in certain contexts (would you identify ‘tranquilize’ as the main action or not?). Overall humans would more or less agree on what the ‘actor’ and ‘action’ items are.

A model that can do this will be deemed a successful Named Entity Recognizer with SpaCy.

 

 

Pretty good, don’t you think? If you are curious how this works, read ahead!

 

The Problem: Resolving land conflicts in India 

Typically, Named Entity Recognition (NER) happens in the context of identifying names, places, famous landmarks, year, etc. These entities come built-in with standard Named Entity Recognition packages like SpaCy, NLTK, AllenNLP.

The challenge for us was to create a custom entity recognizer as our entities were ‘non-standard’ and needed to be adapted to the AI challenge.

The World Resources Insitute (WRI) had approached Omdena to further its project on identifying land-related environmental conflicts in India, which affect more than 7 million people.

 

 

The idea was to identify where the conflicts were happening, what groups of people it was affecting, the scale of the conflicts and to classify the kinds of conflicts and match it with the related governmental policy to resolve them faster.

 

Among these, identifying groups of people, scale, action, location, and date came under the scope of Named Entity Recognition using SpaCy.

In this article, we will deal with identifying actors, actions, and scales. Location and date are standard entities that can be obtained by plug-and-playing an off-the-shelf entity recognizer.

 

The data

The raw data initially was about 65000 news articles from Indian newspapers obtained from GDELT. In its own words, GDELT is ‘Creating a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day and that stretches back to January 1, 1979, through present day.’ All the text was either originally in English or translated to English by GDELT.

 

The Solution: Coreference resolution

An important milestone identified before we started our labeling process was to identify the need for coreference resolution. Consider this fictional text,

‘Farmers were caught in a flood in Maharashtra. Kabir Narayan and Kamal Bashir were tending to their field when a dam burst and swept them away’.

Here, ‘Farmers’, ‘Kamal Narayan’, and ‘Kamal Bashir’ refer to the same entity. However, an entity recognizer will typically treat them as three separate entities. We wanted our entity recognizer to identify them all as ‘farmers’. This is where coreference resolution comes in. Coreference resolution is this essential pre-step in the entity recognition process that identifies entities ‘Kabir Narayan’ and ‘Kamal Bashir’ as referring to the same entity ‘farmer’ that occurs before. We won’t be able to go into any depths about how coreference resolution works. If you’re interested, here’s a useful blog that explains coreference resolution and also shows how to use spaCy’s coreference package, which is also what we used in our solution. Here’s also a blog by Zaheeda Tshankie, the task manager for the coreference resolution task — her take on what coreference resolution looked like in this particular case.

Some subtleties regarding entity labeling.

The next important step in this task was to manually label our entities. In order to train the model, Named Entity Recognition using SpaCy’s advice is to train ‘a few hundred’ samples of text. As it turned out in our case, we had manually identified about 1300 articles as either ‘positive’, i.e. as indeed referring to an environmental conflict or ‘negative’. In the beginning, we aimed to label 500 of these with our custom entities. However, we realized that this was not the easiest or the most suitable task. Here is some subtlety specific to entity recognition tasks — not all texts are suitable for all entity identification. For example, consider this text: ‘India is home to several hundred species of birds’. In this piece of text, it is difficult to identify the ‘action’. This is a descriptive text with no conflict that can be labeled as an ‘action’. For this reason, we decided to restrict our attention to the positive articles only. There were 147 of them.

There is a further subtlety regarding potentially nebulous entities such as ‘action’. From the beginning, the instructions were clear: we were to identify and label only the ‘main action’ of any news article. But, as we realized, this can be a fairly subjective task. For instance, consider the following text.

 

A paragraph explaining about a topic

 

During the labeling, we encountered articles such as the one above. One example of labeling is as shown. This is not incorrect, however, I would have probably labeled this differently, marking only ‘killed’ as the ‘action’, ‘elephants’ and ‘tigress’ as ‘actors’. When we are working with several people during labeling, we have to account for the fact that people may misunderstand rules, through no fault of their own. Rather, the onus is on the rules and the more precise the rules are, the better the labeling process goes. This was a lesson well learned. However, sometimes even when the rules are precise, it is still possible to hit some ‘grey areas’ where it’s difficult to be completely objective and the subjectivity of the labeler comes into play. This is an inherent feature of ‘ambiguous’ labels like action and I am not sure if I have a solution to this. If you have any thoughts on this, please do leave them in the comments.

 

Pre-built entity recognizers

There are several libraries that have been pre-trained for Named Entity Recognition, such as SpaCy, AllenNLP, NLTK, Stanford core NLP. We decided to opt for spaCy because of two main reasons — speed and the fact that we can add neural coreference, a coreference resolution component to the pipeline for training.

If you would like a more detailed comparison of  Named Entity Recognition, such as SpaCy libraries, here’s a blog on it.

 

Using Doccano

In order to make the labeling task as easy and efficient as possible, we decided to use Doccano’s annotating tool. Their description is as follows — ‘Doccano is an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotation. You can build dataset in hours.’.

Here is what it looks like in practice.

 

 

Converting JSON1 to SpaCy format

Doccano provides entities in a JSON1 format and we needed to convert it to a tuple format that spaCy accepts. In the following, you can see the code. Credits to Tomasz Grzegozek.

import json
#Converting JSON1 files to Spacy tuples format
def convert_doccano_to_spacy(filepath):
with open(filepath, ‘rb’) as fp:
data = fp.readlines()
training_data = []
for record in data:
entities = []
read_record = json.loads(record)
text = read_record[‘text’]
entities_record = read_record[‘labels’]
for start, end, label in entities_record:
entities.append((start, end, label))
training_data.append((text, {“entities”: entities})
return training_data

 

Training the model

Here we used the following block of code, inspired by this blog.

 

TRAIN_DATA = train
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank(‘en’) # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if ‘ner’ not in nlp.pipe_names:
ner = nlp.create_pipe(‘ner’)
nlp.add_pipe(ner, last=True)
#nlp.add_pipe(nlp.create_pipe(‘sentencizer’)) 
#Adding sentencizer as a prerequisite to coref
#neuralcoref.add_to_pipe(nlp) #Adding corefering in the pipeline
 ner.add_label(ent[2])
# get names of other pipes to disable them during training
 other_pipes = [pipe for pipe in nlp.pipe_names if pipe != ‘ner’]

with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
  print(“Starting iteration “ + str(itn))
  random.shuffle(TRAIN_DATA)
  losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], 
# batch of texts[annotations], 
# batch   of annotations
drop=0.2, 
# dropout — make it harder to memorise data
sgd=optimizer, 
# callable to update weights
   losses=losses)
   print(losses)
   return nlp
custom_ner = train_spacy(TRAIN_DATA, 20)
# Save our trained Model
custom_ner.to_disk(‘Custom_NER_Model’)

 

Conclusion

The results of the training gave us some pretty good results. The model was especially good at picking up ‘actor’.

 

 

There were failures by the model, too. Here is an example.

 

 

In the example above, the model misses ‘massive protest’ as the important action and instead, identifies a long piece of text (which could be considered a secondary action) as the main action.

As mentioned before, defining ‘action’ is ambiguous even for humans, so it’s no wonder that the model got it wrong a few times. I do believe that with stricter rules for labeling, the model would have performed better.

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here