Modeling Economic Well-being through AI, Satellite Imagery, and Census Data

Modeling Economic Well-being through AI, Satellite Imagery, and Census Data

This article is written by Harshita Chopra along with collaborators Arpan Mishra, Precioso Gabrillo, and Raghunath Pal.



Economic well-being is a broad concept that goes beyond statistical metrics. When you plan on moving to another place, do you primarily check complex economic measures like the GDP of that region? When making such decisions, generally what matters to most people, in layman terms, is the standard of living. The standard of living refers to the necessities, comforts, and lux­uries which a person is likely to enjoy. It refers to the quantity and quality of their consumption. The fundamental reason for differences in the standards of living between regions is the difference in their levels of economic productivity.

Hence, it is important for nations to record a source of primary data that provides valuable information for planning and formulating policies by governments, international agencies, scholars, business people, industrialists, and many more.

This data is usually collected through on-site surveys that need to be performed across vast areas. A list of questions is asked from families and individuals which leads to a huge database.


The Problem

These surveys are conducted over a period of a few years and involve huge manpower and expenditure.

Indian Census 2011 costed INR 2200 Crores (USD 295 million)

There are also associated risks of abuse of data and corruption. Also, the temporal variation of factors affecting economic well-being makes it all the more difficult to compare the progress of regions.

Instead, we try training AI models to learn features related to the changing agricultural and urban landscape thus providing a better understanding of economic well-being.

World Resources Institute (WRI) is a global research organization that spans more than 60 countries and works towards turning big ideas into action at the nexus of environment, economic opportunity, and human well-being. 




WRI brings up an enlightening problem statement — Creating a machine learning algorithm that can be used as a proxy for socio-economic well-being in India, using a remote sensing approach through satellite images.

In order to make this possible, Omdena brought together 40 AI -Engineers from 20+ countries, to collaborate on this project. The aim was to create a prototype that can be used to predict variables or features that represent the standard of living of a place, particularly in data-poor regions. This remote approach would use AI and Computer Vision to extract latent features from satellite images, that can help build a baseline model.


How we solved it

In this article, we’re going to highlight one of the final delivered models based on Indian Census Data. The aim of this economic well-being AI model is to use satellite images and classify each region as a high, medium, or low economic well-being region.


1. Preparing the Ground Truth

In anticipation of the upcoming Census 2021, WRI expressed a major interest in working on its census dataset as there was a need to prepare the model for the upcoming release.

The census data (AI) is an official survey of the population which attempts to gather socio-economic well-being information of households in a specific region and time frame.

Our panel data team mobilized to quickly scrape the Census-2011 website for the District-level household data. There are a number of features representing the condition of houses and assets owned. We wrote a script that extracted these features for each of the 640 districts present in India, in one single CSV file.

We followed the methodology described in this research paper as a guide for our workflow. The census data were subdivided into groups and the features were reduced. We formulated these six variables according to existing research:

  • Fuel for Cooking
  • Main Source of Water
  • Main Source of Light
  • Condition of Household
  • Material of Roof
  • Assets Owned

All of these six variables had three categories —

  • RUD (rudimentary): Features that represent primitive methods such as using firewood, river water, poor house condition, grass roof, etc.
  • INT (intermediate): Features that represent medium-grade methods such as using kerosene for lighting, tubewell water, owning a liveable house, etc.
  • ADV (advance): Features that represent the better condition of households such as owning a car. using electricity, tap water, etc.

After this division, we applied K-means clustering to identify three clusters belonging to the above categories. Each cluster was AI-visualized using a box-plot to associate it with a level of economic well-being.


Example: Three Clusters for Variable — Main Source of Water / Source: Omdena


In the above image, we see three clusters represented by three plots. Each one is telling if it belongs to the Low, Medium, or High economic well-being class. Cluster 1 depicts ‘High’ (since water_adv is the highest), cluster 2 depicts ‘Medium’ (since water_int is the highest) and cluster 3 depicts ‘Low’ (since water_rud is the highest).

This was done for all 6 variables. After this pre-processing of Census data, our dataset looked like this:



Source: Omdena



2. Satellite Image Acquisition

After the ground truth was set up we needed satellite images corresponding to those 640 districts. We had to rely on open-source satellite images, so we selected Google Earth Engine’s services to download these images.

Google Earth Engine provides datasets from various satellites. Since we used census data from 2011 we required the images for the districts corresponding to that year. From the two popular satellites: Sentinel 2 (which has AI data is available from 2015) and Landsat 7 (which has data available from 1999), we selected Landsat 7 Tier 1 TOA Reflectance as our satellite imagery collection, in order to acquire images from 2011.

Landsat 7 images have 30m/pixel resolution which means that every pixel of the image covers 30 meters on earth!


Jalgaon, Maharashtra / Source: Omdena


Next, we decided on the bands we would need in our satellite image. The images stored in our devices contain 3 bands (Red, Green, and Blue). However, satellite images are multi-banded in nature and may contain up to 12 bands. Not all of them would be useful for us, so we settled on Red, Green, Blue, NIR (Near InfraRed), and SWIR-1 (Shortwave InfraRed) bands.

The image of any particular region will vary depending on what time it was taken, the cloud cover, angle of the satellite, etc. Google Earth Engine allows us to filter out all the best images that we can get for a region throughout our time period and then aggregate them into one single image composite. We downloaded 640 median aggregated image composites, each image corresponding to our districts.


3. Creating Features from Images

Every raster that we downloaded contains 5 bands, we merged these bands in different ways to analyze the geographical features of that region.

For example, if we merge the Near Infrared and the Red bands together using the formula, (NIR  —  RED) / (NIR + RED), we will get a single-band image. We call this the NDVI of our image, which stands for Normalized Difference Vegetation Index. 

The special thing about this image is that all the pixels with any shade of green are highlighted. This basically tells us where in the district there is a higher concentration of vegetation. Vegetation is also an indicator of economic well-being, so we can use the NDVI image as an input to our AI model.


Different indices highlight different features on the map. See below for more details on each index. Source: Omdena


Similarly, we can combine the SWIR band with the other bands to get other indices. Apart from the NDVI we also calculated the NDBI (Normalized Difference Built-Up Index) and the NDWI (Normalized Difference Water Index). As the name suggests the NDBI highlights the concentration of the built-up areas and NDWI highlights the water content, both of which could be indicators of social well-being.


4. Model Architecture

The inputs and expected outputs were ready! The task was to build an image classification model. Most of the available pre-trained models are for RGB images. However, satellite images contain multiple bands. So we decided to generate 3-band images using the extracted indices.

To be able to use transfer learning, as well as utilize all the features in our images, we came up with the following architecture:

A Multi-modal Multi-task Deep Learning Model — It inputs two images and outputs values for multiple variables.


The Multi-modal Multi-task Deep Learning Model. Source: Omdena


We take two inputs:

  • The first is an RGB image, which is just like any normal image we look at, containing the Red, Green, Blue bands.
  • The second is a combination of the NDVI, NDBI, and NDWI of our image.
    NDVI — Normalized Difference Vegetation Index
    NDBI — Normalized Difference Built-up Index
    NDWI — Normalized Difference Water Index
    This new image highlights a mixture of features corresponding to vegetation cover, built-up area, and water bodies of that region.

We then rescale the pixels to the 0–255 range so that the pre-trained models can be used with them.

Our data was divided such that all states are represented well in the model. 80% of districts of each state went to the training set and 20% went to the test set. Our model was subjected to a clever 10-fold cross-validation scheme, which means that the entire dataset was given a chance to undergo the economic well-being prediction by the AI model.

Images are passed through a popular Deep Learning Neural Network architecture — RESNET-18, combined with a fully connected layer to get our desired outputs. The model outputs three classes (high, medium, low) for multiple indicators as mentioned above. Hence, we solved a multi-modal multi-task learning problem. The model achieved an overall accuracy close to 70%.


Results and Insights

To analyze the overall development of a region based on the six different indices, we curated an Overall Development Index (ODI) to judge the economic well-being of a region as a whole. The index score calculated for each district ranged from 6 to 18 and was calculated as follows:

Overall Development Index (ODI) = A1+A2+A3+A4+A5+A6, where

Ax = 1 if Ax = “Low”
   = 2 if Ax = “Medium”
   = 3 if Ax = “High”
x: Type of Index

Ground Truth vs Model Predictions of Overall Development Index for Census 2011 / Source: Omdena


Driven by our curiosity and project interests, we also decided to deep dive into our data to uncover hidden statistics and actionable insights. Subjecting the data to a popular technique called Exploratory Data Analysis and with the help of BI tools like Tableau, Google Data Studio — we created dashboards to visualize the data in different customizable views.

We discovered that the districts of India were almost evenly distributed in terms of High, Medium, and Low Overall Development.


Distribution of Districts by the Overall Development Index for each indicator of economic well-being. Source: Omdena




Satellite images can act as a great proxy for existing data collection techniques such as surveys and census to predict the economic well-being of a region. It also makes it possible to determine the economic well-being of areas that are inaccessible to humans, for example, the rocky terrains of the northeastern region of India, the Himalayas or villages in the deserts.

The model is highly scalable and adaptable and can be trained on existing satellite imagery and surveys of other countries as well. It can help save a lot of manpower and time which acts as a major challenge in our existing development assessment initiatives. The prototype developed by our team in this eight-week challenge can be a springboard to a wider and in-depth expansion of this machine learning tool for predicting economic well-being.

With rapid advancements in technology, possible future work can include using high-resolution images or other popular datasets such as the Demographic and Health Survey (DHS) or Living Standards Measurement Surveys (LSMS) as the ground truth. Future applications also include tracking urbanization along with vegetation cover over a period of time. This can reflect on how the socio-economic conditions of regions evolve along with changing environmental factors.

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

Steps towards building an ethical credit scoring AI system for individuals without a previous bank account.



The background

With traditional credit scoring system, it is essential to have a bank account and have regular transactions, but there are a few groups of people especially in developing nations that still do not have a bank account for a variety of reasons; they do not see the need for it, some are unable to produce the necessary documents, for some the cost of opening the accounts is high, some may not have the knowledge about opening accounts, lack of awareness, trust issues and some unemployed.

Some of these individuals may need loans for essentials; maybe to start a business or like farmers who need a loan to buy fertilizers or seeds. While many of them may be reliable creditors but because they do not get access to funding, they are being pushed to take out high-cost loans from non-traditional, often predatory lenders.

Low-income individuals have an aptitude for managing their personal finances. And we need a system for ethical credit scoring AI in order to help these borrowers and clutch them from falling into deeper debts.

Omdena partnered with Creedix to build an ethical AI-based credit scoring system so that people get access to fair and transparent credit.


The problem statement


The goal was to determine the creditworthiness of an un-banked customer with alternate and traditional credit scoring data and methods. The data was focused on Indonesia but the following approach is applicable in other countries.

It was a challenging project and I believe everyone should be eligible for a loan for essential business ventures but they should be able to pay it back while not having to pay exorbitant interest rates. Finding that balance was crucial for our project.


The data

Three datasets were given to us,

1) Transactions

Information on transactions made by different account numbers, the region, mode of transaction, etc.


2) Per capita income per area

All the data is privacy law compliant.


3) Job title of the account numbers

All data given to us was anonymous as privacy was imperative and not an afterthought.

Going through the data we understood we had to use unsupervised learning since the data was not labeled.

Some of us were comparing online available data sets to the data set we had at hand, and some of us started working on sequence analysis and clustering to find anomalous patterns of behavior. Early on, we measured results with silhouette score — a heuristic tool to figure out if the parameters we had would provide significant clusters. The best value is 1 with well separable clusters, and the worst is -1 with strongly overlapping ones. We got average values close to 0s, and these results were not satisfactory to us.


Feature engineering

With the given data we performed feature engineering. We calculated per ca-pita income score and segregated management roles from other roles. We also calculated the per capita income score so that we can place buckets into accounts in areas that are likely to be reliable customers. For example. management roles mean they would have a better income to pay back.




But even with all the feature engineering, we were unable to get a signal from the data given for clustering. How did we proceed?

We scraped data online from different sites like indeed and numbeo. Since we had these challenges we were not able to give one solution to the customer and had to improvise to provide a plan for future analysis, so we used dummy data.

We scraped data from sites like numbeo to get the cost of living per area, how much they spend on living. From indeed we got salary data to assign an average salary to the jobs.




With the data, scraped online and feature engineering from the given dataset, we tried to figure out if we can get a prediction from using clustering algorithms.


The solutions

  • Engineered Features & Clusters (from Datasets given)
  • Machine Learning Pipelines/Toolkit (for Datasets not provided)
  • Unsupervised Learning Pipeline
  • Supervised Learning Pipeline (TPOT/auto-sklearn)


1. Engineered features & clusters



As mentioned above, with the context that we have gathered from Creedix, we have engineered or aggregated many features based on the transaction time series dataset. Although these features describe each customer better, we can only guess the importance of each feature with regards to each customer’s credit score based on our research. So, we have consolidated features for each customer based on our research on credit scoring AI. As for the importance of each feature with regards to credit scoring AI in Indonesia, this will be up to the Creedix team to decide.

For example,

CreditScore = 7*Salary + 0.5*Zakut + 4000*Feature1 + …+ 5000*Feature6

Solutions given to Creedix were both Supervised Learning and Unsupervised Learning. Even after all the feature engineering and data found online we were still getting a low silhouette score signifying that there would be overlapping clusters.

So we decided that we will provide solutions for Supervised Learning using Auto ML and Unsupervised learning, both using dummy variables, the purpose -was to serve future analysis or future modeling for the Creedix Team.

The dataset we used for Supervised Learning —

With Supervised Learning, we did modeling with both TPOT and Auto SKLearn. This was done so that when we have more features available that are accessible to them but may not be for Omdena collaborators they can use the information to build their models. When they have target variables to use.


2. The model pipeline for Supervised Learning

Our idea is to create a script that can take any datasets and automatically search for the best algorithm by iterating through all classifiers/regressors, hyperparameters based on user-defined metrics.

Our initial approach was to code from scratch iterating individual algorithms from packages (e.g. sklearn, XGBoost and LightGBM) but then we came across Auto ML packages that already do what we wanted to build. Thus, we decided to use those readily available packages instead and not spend time reinventing the wheel.

We used two different auto ML packages TPOT and Auto-sklearn. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines and finding the best one for your data.



Auto-sklearn frees an ML user from algorithm selection and hyperparameter tuning. It leverages recent advances:

  • Bayesian optimization
  • Meta-learning
  • Ensemble construction

Both TPOT and auto-sklearn are similar, but TPOT stands out between the two due to its reproducibility. TPOT is able to generate both the model and also its python script to reproduce the model.


3. Unsupervised Learning

In the beginning, we used agglomerative clustering (a form of hierarchical clustering) since the preprocessed dataset contains a mix of continuous and categorical variables. As we have generated many features from the dataset (some of them very similar ones, based on small variations in their definition), first we had to eliminate most of the correlated ones. Without this, the algorithm would struggle to find the optimal number of groupings. After this task, we remained with the following groups of features:

  • counts the number of transactions per month (cpma),
  • average increase/decrease in value of specific transactions (delta),
  • average monthly specific transaction amount (monthly amount),

and three single specific features:

  • Is Management — assumed managerial role,
  • Potential overspend — value estimating assumed monthly salary versus expenses form the dataset,
  • Spend compare — how customer’s spending (including cash withdrawals) differs from average spending within similar job titles.


In a range of potential clusters from 2 to 14, the average silhouette score was best with 8 clusters — 0.1027. The customer data was sliced into 2 large groups and 6 much smaller ones, which was what we were looking for (smaller groups could be considered anomalous):



This was not a satisfactory result, anyway. On practical grounds, describing clusters 3 to 8 proved challenging, which is correspondent with a relatively low clustering score.



It has to be remembered that the prime reason for clustering was to find reasonably small and describable anomalous groupings of customers.

We, therefore, decided to apply an algorithm that is efficient with handling outliers within a dataset — DBSCAN. Since the silhouette clustering score is well suited for convex clusters and DBSCAN is known to return complex non-convex clusters, we forgo calculating any clustering scores and focus on the analysis of the clusters returned by the algorithm.

Manipulating the parameters of DBSCAN, we found the clustering effects were stable — the clusters contained similar counts, and customers did not traverse between non-anomalous and anomalous clusters.

Also analyzing and trying to describe various clusters we find it easier to describe qualities of each cluster, for example:

  • one small group contrary to most groups had no purchase, no payment transactions, and no cash withdrawals, but very few relatively high transfers by mobile channel,
  • another small group also had no purchase and no payment transactions, however, made cash withdrawals,
  • yet another small group had the highest zakat payments (for religious causes) and high amount of mobile transactions per month,
  • The group considered as anomalous (cluster coded with -1) with over 300 customers differentiated itself with falling values across most types of transactions (transfers, payments, purchases, cash withdrawals) but sharply rising university fees.



Important to note is that for various sets of features within the data provided here, clustering score for both hierarchical as well as DBSCAN methods returned even better clustering efficiency scores. However, at this level of anonymity (i.e. without the ground truth information), one cannot decide the best split of customers. It might transpire there is a relatively different optimal set of features that best splits customers and provides better entropy scores of these groups calculated on the creditworthiness category.

Enhancing Satellite Imagery Through Super Resolution

Enhancing Satellite Imagery Through Super Resolution

By James Tan


The power of deep learning paired with collaborative human intelligence to increase crop cultivation through super resolution.


The Problem

In order to accurately locate crop fields from satellite imagery, it is conceivable that images of a certain quality are required. Although Deep Learning is notoriously known for being able to pull off miracles, we human beings will have a real field day labeling the data if we cannot clearly make out the details within an image.

The following is an example of what we would like to achieve.


Semantic Segmentation Example



If we can clearly identify areas within a satellite image that correspond to a particular crop, we can easily extend our work to evaluate the area of cultivation of each crop, which will go a long way in ensuring food security.


The Solution

To get our hands on the required data, we explored a myriad of sources of satellite imagery. We ultimately settled on using images from Sentinel-2, largely due to the fact that the satellite mission boasts images of the best quality amongst other open-source images.



Is the best good enough?


Original Image



Despite my previous assertion that I am not an expert in satellite imagery, I believe that having seen the above image we can all agree that the quality of it is not quite up to scratch.

A real nightmare to label!

It is completely unfeasible to discern the demarcations between individual crop fields.

Of course, this isn’t completely unreasonable for open-source data. Satellite image vendors have to be especially careful when it comes to the distribution of such data due to privacy concerns.

How outrageous it would be if simply anyone can look up what our backyards look like on the Internet, right?

However, this inconvenience comes at a great detriment to our project. In order to clearly identify and label crops in an image that is relevant to us, we would require images of much higher quality than what we have.


Deep Learning practitioners love to apply what they know to solve the problems they face. You probably know where I am getting with this. If the quality of an image isn’t good enough, we try to enhance it of course! The process we like to call super-resolution.


Deep Image Prior

This is one of the first things that we tried, and here are the results.


Results of applying Deep Image Prior to the original image.


Quite noticeably there has been some improvement, the model has done an amazing job of smoothening out the rough edges in the photo. The pixelation problem has been pretty much-taken care of and everything blends in well.

However, in doing so the model has neglected finer details and that leads to an image that feels out of focus.



Naturally, we wouldn’t stop until we got something completely satisfactory, which led us to try this instead.


Results of applying Decrappify to the original image.



Now, it is quite obvious that this model has done something completely different than Deep Image Prior. Instead of attempting to ensure that the pixels blend in with each other, this model instead places great emphasis on refining each individual pixel. In doing so it neglects to consider how each pixel is actually related to its surrounding pixels.

Albeit being successful in injecting some life into the original image by making the colors more refined and attractive, the pixelation in the image remains an issue.


The Results


Results of running the original image through Deep Image Prior and then Decrappify.


When we first saw this, we couldn’t believe what we were watching. We have come such a long way from our original image! And to think that the approach taken to achieve such results was such a silly one.

Since each of the previous two models were no good individually, but they clearly were good at getting different things done, what if we combined the two of them?

So we ran the original image through Deep Image Prior, and subsequently fed the results of that through the Decrappify model, and voila!

Relative to the original image, the colors of the current image look incredibly realistic. The lucid demarcations of the crop fields will certainly come a long way in helping us label our data.


Our Methodology

The way we pulled this off was embarrassingly simple. We used Deep Image Prior which is found at its official Github repository. As for Decrappify, given our objectives, we figured that training it on satellite images would definitely help out. Having the two models readily set up, its just a matter of feeding images into them one after the other.


A Quick Look at the Models

For those of you that have made it this far and are curious about what the models actually are, here’s a brief overview of them.

Deep Image Prior

This method hardly conforms to conventional deep learning-based super-resolution approaches.

Typically, we would create a dataset of low and super-resolution image pairs, following which we train a model to map a low-resolution image to its high-resolution counterpart to increase crop cultivation. However, this particular model does none of the above, and as a result, does not have to be pre-trained prior to inference time. Instead, a randomly initialized deep neural network is trained on one particular image. That image could be one of your favorite sports stars, a picture of your pet, a painting that you like, or even random noise. Its task is then, to optimize its parameters to map the input image to the image that we are trying to super-resolve. In other words, we are training our network to overfit to our low-resolution image.

Why does this make sense?

It turns out that the structure of deep networks imposes a ‘naturalness prior’ over the generated image. Quite simply, this means that when overfitting/memorizing an image, deep networks prefer to learn the natural/smoother concepts first before moving on to the unnatural ones. That is to say that the convolutional neural network (CNN) will first ‘identify’ the colors that form shapes in various parts of the image and then proceed to materialize various textures in the image. As the optimization process goes on, CNN will latch on to finer details.

When generating an image, neural networks prefer natural-looking images as opposed to pixelated ones. Thus, we start the optimization process and allow it to continue to the point where it has captured most of the relevant details but has not learned any of the pixelations and noise. For super-resolution, we train it to a point such that the resulting image it creates closely resembles the original image when they are both downsampled. There exist multiple super-resolution images that could have produced each low-resolution image to increase crop cultivation.

And as it turns out, the most plausible image is also the one that doesn’t appear to be highly pixelated, this is because the structure of deep networks imposes a ‘naturalness prior’ on generated images.

We highly recommend this talk by Dmitry Ulyanov (who was the main author of the Deep Image Prior paper) to understand the above concepts in depth.








The super-resolution process of Deep Image Prior



In contrast with the previous model, here it is about to learn as much possible about satellite images. As a result, when we give it a low-quality image as an input, the model is able to bridge the gap between a low and high-quality version of it by using its knowledge of the world to fill in the blanks.

The model has a U-net architecture with a pre-trained ResNet backbone. But the part that is really interesting is in the loss function, which has been adapted from this paper. The objective of this model is to produce an output image of higher quality, such that when it is fed through a pre-trained VGG16 model, it produces minimal ‘style’ and ‘content’ loss relative to the ground truth image. The ‘style’ loss is relevant because we want the model to be able to be careful in creating a super-resolution image with a texture that is realistic of a satellite image to increase crop cultivation. The ‘content’ loss is responsible for encouraging the model to recreate intricate details in its higher quality output.




More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up global collaboration.

A Practical Guide for Creating A Quality Satellite Imagery Dataset for Agricultural Applications

A Practical Guide for Creating A Quality Satellite Imagery Dataset for Agricultural Applications

By Łukasz MurawskiAlexander Epifanov, and Jayasudan Munsamy.

This article is the result of working in Omdena’s AI project to estimate crop yield and crop classification with the UN World Food Program in Nepal by using satellite imaging analysis. The problem was tough, challenges were huge, and resources scarce. Still, a community of 36 collaborators managed to build a solution with 89% accuracy. This article focuses on the dataset creation.


The characteristics of a ‘good’ dataset

Rubbish in — rubbish out’ is a popular phrase referring to poor ML model results caused by poor quality datasets.

No matter how good a model architecture is, it’s useless if not trained with appropriate good data. At the same time, the opposite holds true — sometimes even the basic model architecture is enough to get the job done if fed with a proper dataset for training.

And while so much has been written about ML model building, we think that dataset preparation has been a bit disregarded and this is understandable because it ain’t too sexy of a subject and takes forever to get it right. Unfortunately, our path to realizing the importance of ‘creating a good dataset’ was painful and agonizing.

So what really is ‘good’ dataset? Every project is different and has unique requirements for its dataset.

Let’s start with the characteristics of an expected outcome — our ML model. It seems that a good model is the one that achieves best results on a variety of input data — we say that model generalizes well and we call it ‘robust’.

In short, to achieve this, we need to train the model with high quality, precisely labeled datasets representing the full spectrum of input possibilities. The importance of data in ML can be understood from the fact that ‘typically 80% of an ML project is spent on data — analysis, gathering & engineering’.

The key characteristics of a good dataset are listed below:

  • Data distribution — data should cover all or most of the possible spectrum of the input
  • Data coverage — every class should have enough representation in the dataset
  • Data accuracy — data should be highly relevant to the task in hand and be as close as possible to that used for inference, in terms of quality, format, etc.
  • Feature engineered — data should enable the ML model to learn what we intend it to learn (appropriate features)
  • Data transformation — almost always data acquired cannot be used as-is and an appropriate data transformation pipeline can simplify the model architecture
  • Data volume — depending on whether the ML model is built from scratch or learning transferred from another model, availability of data is critical
  • Data split — typically data is split into 3 chunks: training (75%), validation (15%) & test (10%) and it’s important to ensure there is no ‘duplicate/same’ data across these chunks and the samples are distributed properly


One of the key challenges with ML projects is that exact requirements for data are NOT known at the time of data analysis & gathering and sometimes is known only after the model is built and shortcomings understood. So, an iterative approach to creating and refining datasets in order to improve model metrics is a safe bet.



The Solution: Seven recommendations for creating satellite imaging analysis datasets for crop classification


Earth seen through satellite

Photo by NASA on Unsplash


Satellite images can be invisible colors (RGB) and in other spectra, e.g. data within specific wavelength ranges across the electromagnetic spectrum like Near-Infrared. There are also elevation maps, usually made by radar images which can be used to estimate vegetation growth rate, etc. Normally, the interpretation and Satellite Imaging Analysis method are conducted using specialized remote sensing software but advancements in AI have made autonomous, large scale analysis of satellite imaging analysis possible such as Crop Classification. We have listed some of the key points to consider.

  • Ground truth data — know the ground truth
  • Source of satellite images — decide the source
  • Spatial distribution — know the terrain
  • Temporal distribution — know the crop growth cycle
  • Image quality — know what’s in the images
  • Vegetation indices — know the right indices
  • Labeling and Masking — know what is what & wherein the images


#1 Ground truth data — know the truth

Almost every stage of the project is dependent on the Ground Truth (GT) data provided. In satellite imaging analysis for the crop classification, it should contain all the details about the crop fields, which will help to identify them individually, so the information can be fed to ML models via appropriate datasets for correct feature extractions. In many cases, GT data is in the form of a file containing the required information gathered during field surveys. Usually, it’s a simple spreadsheet file filled with a full variety of field information that can be directly used for labeling. But in practice, we may not have all the required information and should be cautious to pay attention to the content of the file for two main reasons:

  • Acquiring comprehensive field survey results is expensive, if possible at all
  • Preparing proper Ground Truth data file requires an understanding of all the details required for creating robust ML models

In short, the goal of the Ground Truth data should be to provide complete, well-balanced, and properly distributed data. It should also serve as a reference on how to recognize different objects of interest to facilitate complete and reliable labeling.

It should have the following characteristics:

  • Contain all the required details of data (crop parcels in agriculture case) — for example in case of crop type identification GT should specify crop field id, their dimensions, GPS locations, shape, size, crop cultivated, seasonal crops for the field, crop cycle details, land use patterns, etc.
  • A well-balanced number of classes (Ex: equal or similar number of samples for each class)
  • Well distributed spatially (various terrains in area of interest) and temporally (covering various time periods like seasons/crop cycles, etc.), representing the full range of possible scenarios
  • Number of data points must exceed the expected number of images since for some/many of them there won’t be good satellite images (due to weather conditions, pollution, etc.)
  • Should contain samples from few different years, so in case a year’s satellite images cannot be used for bad weather conditions or some other reasons, we can utilize images from other years mentioned in ground truth data to compensate for data loss
  • Should highlight periods when objects of interest (say different crops) are easiest to be recognized. For example, months when the crops are in the fully grown stage to help in easy labeling (from there, we can easily propagate masks knowing the vegetation specifics)
  • Should include examples of a similar kind of crops/vegetation (not just visually but also in terms of VIs if possible) in the nearby regions around the area of interest. For example, if rice fields are the class of interest, examples of grass, cornfields, etc. in the surrounding area which look similar to rice field should be also included

The key point to remember is that Ground Truth data quality will have a big impact on the dataset created, labeling/masking done on a dataset, and ultimately results of the solution.


#2 Source of satellite images — decide the source

With the advent of satellites launched by many countries and private organizations, Satellite Imaging analysis has become more accessible to the general public for a variety of applications, in our case, crop classification. Some of the more popular programs are Landsat (by USGS & NASA, 30m resolution since the early 1980s), MODIS (by NASA, near-daily satellite imagery of earth in 36 spectral bands since 2000), Sentinel (by ESA, 5 days frequency of earth in 16 spectral bands since 2016) and ASTER (by NASA, detailed maps of land surface temperature, reflectance, and elevation).


Organizations selling Satellite Imaging analysis techniques

Several private organizations sell raw & processed satellite imagery with customized data as required by customers. Few popular ones are GeoEye (since Sep2008, images with a ground resolution of 0.41 meters (16 inches) in the panchromatic or black and white mode also has multispectral or color imagery at 1.65-meter resolution or about 64 inches), DigitalGlobe (imagery with 0.46m & 0.6m panchromatic only spatial resolution, also images with 0.31 m spatial resolution), OneAtlas platform (by Airbus, Optical & Radar Earth Observation), Spot Image (by Bratislava, images with 1.5 m for the panchromatic channel, 6m for multi-spectral and 0.50 meter or about 20 inches) and ImageSat International (also known as “EROS” satellites, images can be used for mapping, border control, infrastructure planning, agricultural monitoring, environmental monitoring, disaster response, training, and simulations, etc.).

Key decision points for choosing the satellite imaging source, for crop classification:

  • Raw or processed datasets — we can’t use raw satellite imaging analysis and processing of satellite images for crop classification is an involved activity using various tools: so processed images which are available readily as part of datasets are a good choice to start with
  • Image quality — sharp images with clear differentiation of objects we are interested in is critical: the higher the resolution, the better the results with the ML model
  • The spatial resolution of images — it’s the area on the ground covered by a single pixel in the satellite image: the lower the resolution (they go down to 15cm), the better they are. Few organizations improve the spatial resolution of final images by applying scaling techniques which sometimes may cause the undesirable quality of images and hence need to watch-out (ex: few bands of Sentinel2 dataset are resampled/scaled with constant Ground Sampling Distance metric depending on native resolutions of the bands and hence can have a spatial resolution of 10m, 20m or 60m, but the corresponding images will be low quality due to sampling)
  • Free or paid — may seem like an easy choice, but various aspects like quality, the processing is done, completeness of data, etc. in the provided datasets depend on effort spent by the provider: the source of images used at inference time and accuracy/other metrics of ML models mainly drive the decision
  • Temporal images coverage — depending on where, when & purpose of the satellites that were launched, imagery may be available only for certain geography and period (ex: Sentinel2 Level1C dataset has images only from June 2015 onwards): temporal data required for the task will help to decide
  • Spectral bands to use — depending on the sensors in satellites, various spectral data will be available in images (ex: Sentinel2 imagery contains 13 spectral bands): usage of various vegetation indices for the type of remote sensing task will help to decide
  • Number of images per day/week/month — depending on the frequency of satellite orbiting over the area of interest, number of images may vary (ex: Sentinel2 orbits over a location every 5 days, so in a month, there will be 4 to 6 images of a particular location: volume of images required will help to decide
  • Image processing should be done — image quality decreases with various factors like cloud cover, haze cover, pollution distractions, etc. many organizations apply various processing techniques to get rid of such distractions in images: image quality requirements for the task at hand will help to decide


#3 Spatial distribution — know the terrain


Nepal’s agro-ecological zones in different terrains

Nepal’s agro-ecological zones in different terrains


While RGB bands in satellite images can show the crop fields, the terrain of these fields also plays an important role. For example, crop fields in plains tend to be large, with more regular shapes and similar crops are usually in the neighborhood; whereas in hilly areas crop fields tend to be small, different shapes & altitude and mix of other vegetation may be surrounding the fields; similarly in forest areas crop fields tend to be surrounded by thick trees without a clear visual representation of fields and their borders. Hence understanding the various terrains the crop fields are in becomes important.

The recommendation is to look at satellite imaging analysis from different time periods/seasons to understand the terrain of an area of interest, include images from various terrains in the dataset, consider the challenges with images from different areas while labeling/marking and address those challenges as much as possible with appropriate labeling like crop classification.


#4 Temporal distribution — know the growth cycle


Photo of corn fields

Photo by Kai Pilger on Unsplash


Satellite images from all the months covering various growth stages of crops should be added to the dataset. These images will help the ML model to generalize well and be able to accurately identify crops irrespective of the growth stage of crops. A better understanding of the crops’ growth cycle and seasonal crops cycle can help to find satellite images of crops at different stages of growth.

While looking for temporal data, there is a possibility that a few months in a specific year do not have any images due to bad weather or climatic conditions. In such cases, consider choosing images from other years for these months. An important assumption that needs to be validated here is that the crop cycle & seasonal crops for those years are the same as the year with ground truth. In some cases, the same crop can have a different growth cycle in different regions.

In our case with Nepal, there were 10 to 12 varieties of rice widely adopted by farmers, having two main growing seasons depending on rice variety: 1) Spring rice (February/March to June/July): Chaite 2, Chaite 4, Ch 45, Bindeswar, etc. and; 2) Main season rice (June/July to October/November): Mahsuri, Savitri, etc. (Source).

Another important point to consider regarding temporal data is the land use pattern of cultivated fields. Though there might be defined crop cycle for each crop, there can be scenarios where the same crop fields are used for different crop cultivation in different seasons (ex: seasonal short-term crops may be cultivated in the same fields after main crop’s harvesting is done and before next sowing). Missing out on satellite images from different months representing seasonal crops will result in an incomplete dataset, resulting in inaccurate ML models.

Talking to the subject matter experts/farmers in the area of interest to understand the temporal data to be captured is critical at this stage of dataset creation.


#5 Image quality — know what’s in the images

The resolution of satellite images is relatively high and image processing is time-consuming. Similarly, depending on the sensor from which the imagery was created, appropriate processing is required before consuming the images. For the same reason, weather (rain, clouds, etc.) & environmental (pollution, haze, etc.) conditions can affect image quality. For such reasons, publicly available satellite image datasets are typically processed for visual or scientific commercial use by third parties.

Just like any other digital image, the resolution of satellite images is critical for the purpose and varies depending on the instrument used and the altitude of the satellite’s orbit. There are four types of resolution when discussing satellite imagery in remote sensing: spatial (pixel size of an image representing the size of the surface area being measured on the ground), spectral (wavelength interval size and number of intervals), temporal (amount of time/days that passes between imagery collection periods for a given surface location) and radiometric (levels of brightness/contrast).


Satellite image spatial resolution vs quality

Satellite image spatial resolution vs quality


Though there are open datasets of satellite imaging analysis available to the public free of cost, quality images to be used for specific purposes like crop growth detection, crop classification, crop type identification, etc. are expensive. The higher the quality, the higher the cost. As simple as that. But not always high-quality images are required. For example, for a project with the objective of identifying building structures in satellite images, we may not require images with spatial resolution as low as 30cm or so. So, there’s no standard guideline or single rule suggesting the minimal or maximum image quality required for a project. It all depends on the objective of the project. However, there are two main factors which need to be considered before deciding on the quality of images to use:

  • ML model’s accuracy/performance metrics — will the image quality is chosen help to meet the high-performance requirements of the project?
  • Data labeling potential — will the image quality is chosen to be good enough for labeling given the kind of objects to be identified from images?

The decision on point 1 is quite obvious — we can test the ML models using images of different quality and choose the one that meets the project’s requirements/metrics with the lowest quality images. The decision on point 2 is more subjective in terms of what objects are to be identified on images and should be agreed upon at the beginning of the project by carefully assessing the labeling capabilities. But, the image quality must be good enough for labeling, so people can easily see and draw masks around all objects/classes of interest. Solutions that require small objects to be identified, where distinguishing edges is more important than counting the overall coverage, require very high-quality images. That was the case with Omdena’s Trees Recognition Challenge, where the goal was to identify trees close to electricity lines in order to prevent power outages and fires sparked. Here, extremely accurate masking, close to the tree’s edges were necessary. For that, high-resolution 0.5m spatial resolution pictures had to be used. Thus, not only trees but also little bushes and shadows had to be precisely annotated. And that paid off. With only 150 original images and very basic transformations, using Deep UNet Model, the team achieved around 95% accuracy.


From trees identification project at Omdena


For our crop identification project in Nepal, such a high resolution was not necessary as crop field shapes were pretty much regular (except for the ones in hilly areas), with borders being mostly straight lines. So, in this project, the ability to distinguish similar objects (like rice vs. grass) at the labeling stage was the main factor to decide on image quality. We ended up using Sentinel2 Level-1C satellite images with 13 spectral bands from the Copernicus program (European Space Agency) with a maximum spatial resolution of 10m per pixel for certain bands. Unfortunately, it turned out that the max zoom level to get clear RBG images was only 100m. And that zoom level was not good enough for labeling since the crop field areas appear too small in the agricultural setting as seen in the images below (actual images used were 500×500 dimension and yet many fields appeared too small /unclear to recognize).



Crop fields satellite images (left: rice, right: wheat) at 100m zoom level


Since data from RGB channels are not enough for our AI model to identify crops, we could use data from the other spectral bands in Sentinel2 imagery to calculate different vegetation indices, include them in images and then train ML models with that dataset. As suggested in this paper, the dataset can be based on images with other bands including RGB and appropriate vegetation indices.

However, without good RGB images, the problem of proper labeling/masking persists. And it seems that there are basically very few options as listed below:

  • Assuming that the final solution can’t be based on commercially obtained high-quality satellite imaging analysis, they might be necessary at-least for one-time masking/labeling, just at the model creation phase. Those masks can then be used to train the model using multi-spectral bands and calculated appropriate vegetation indices as bands in images for crop classification.
  • Make sure the ground truth data defines the dimensions of every field precisely by using GPS coordinates to help labeling/masking team mark crop fields accurately.
  • Draw masks on satellite images automatically using GIS software by physically being in-the-field with teams of people with GPS devices.


#6 Vegetation indices — know the right indices

As seen in the above section, for specialized tasks like differentiating vegetation types, it is required to analyze data contained in other spectral bands (ranging from 3 to 16 bands) of satellite images other than just RGB bands. This is where Vegetation Indices (VIs) play a critical role. Vegetation Indices are combinations of surface reflectance at two or more wavelengths designed to highlight a particular property of vegetation.

They are derived using the reflectance properties of vegetation. Each of the VIs is designed to accentuate a particular vegetation property. Satellite images from the various organizations have a varying number of spectral bands containing data useful for VI calculations. For example, Sentinel 2 satellite imagery from ESA is a wide-swath, high-resolution, multi-spectral imaging mission supporting land monitoring studies — vegetation, soil, water cover, inland waterways & coastal areas and have 13 spectral bands containing various top of atmosphere (TOA) reflectance levels which can be used for a variety of VI calculations.

More than 150 VIs have been published in scientific literature, but only a small subset have a substantial biophysical basis or have been systematically tested. Many tools/platforms provide support for calculating various VIs. For example, the SNAP platform part of Sentinel Toolboxes supports around 21 VI calculations and the ENVI Image Analysis platform provides 27 VIs to use to detect the presence and relative abundance of pigments, water, and carbon as expressed in the solar-reflected optical spectrum (400 nm to 2500 nm). VIs can be broadly categorized into the following groups — Broadband Greenness, Narrowband Greenness, Light Use Efficiency, Canopy Nitrogen, Dry or Senescent Carbon, Leaf Pigments, and Canopy Water Content.

The below table shows few veg indices applicable to differentiating Rice, Wheat & Other crops with respect to the Sentinel2 Level1C dataset of Nepal’s specific cultivation areas. The analysis was done based on data from 30+ sample crop fields of each category. Unfortunately, the values for VIs of crops vary from region to region based on their climatic conditions, temperature, soil conditions, etc. and hence study/analysis done for crops in one region cannot be reused for the same crops in other regions and hence need to be calculated for each region/area of interest.


Vegetation Indices analysis for crops (rice and wheat) differentiation


Specifically in case of ‘crop type identification’, calculating the chosen vegetation indices, adding them as bands /channels in images dataset, and then feeding them to ML models will help model learn correlations between crop type and veg indices and ultimately identify different crop types with high accuracy. So, do not forget to explore various VIs suitable for the task in hand.


#7 Labeling/ Masking — know what & where

As per the ML model architecture Classification or Segmentation, objectives of labeling will change:

  • For classification — diversified dataset of pictures labeled with one class per image
  • For segmentation — every parcel belonging to each class of interest needs to be annotated/masked in every image, along with other objects around them

Since labeling for segmentation task is more demanding, here are few points to do it right:

  • Every crop field belonging to every class of interest should be masked. Omitting/mislabeling and not being accurate enough will result in false negatives/positives at the model training stage and will impact the model performance
  • Labeling needs to be accurate with regards to objects of interests in the image — say crop field/parcel borders: ideally, every parcel should have its own, clearly outlined border
  • If the labeling is required for temporal data (satellite imaging analysis a location is taken at various time intervals like week or month or year) like in the case such as crop type identification project and crop classification, we can manually label only images with objects when they’re in the easiest to recognize stage. That means, in crop type identification project, we can label only images with crops in a fully grown-up stage. Since the location of the fields in images don’t change across temporal images, masks from those images can be propagated to the other temporal images, saving a lot of time & effort
  • To ensure our model learns to differentiate crop types cultivated at the very same fields during different seasons (Ex: rice-wheat-fallow, rice-winter maize-fallow), it is important to annotate not only the main seasonal crops but also samples of other crops cultivated in the same field
  • Labeling should enable ML models to differentiate objects of interest in images from the ones that look similar to the main objects. So, it is a good idea to have samples of similar objects in a dataset and labeling them appropriately. For example, grass & little bushes might look similar to the rice field on RBG images. So creating a separate label category for these similar-looking objects is a good idea
  • Similarly, areas/environments surrounding the main object of interest in an image also should be labeled to help highlight differences in them. For example, in the case of crop identification project, images with rice fields near river banks or on the river bed should be annotated with river and riverbank as well. Since the rice field in the initial 1 to 5 weeks or so will be filled with water, the model should be able to differentiate rice fields during the first few weeks and actual river or a pond nearby.



About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.


Tackling Energy Poverty in Nigeria Through Artificial Intelligence

Tackling Energy Poverty in Nigeria Through Artificial Intelligence

Can AI help to address energy poverty in Nigeria where more than 100m people lack stable access to electricity?


By Laura Clark Murray 


A staggering 1 billion people on Earth live in energy poverty

Without stable access to electricity, families can’t light their homes or cook their food. Hospitals and schools can’t dependably serve their communities. Businesses can’t stay open.

Energy poverty shapes and constrains nearly every aspect of life for those who are trapped in it. As the Global Commission to End Energy Poverty puts it, “we cannot end poverty without ending energy poverty.” In fact, energy poverty is considered to be one of humanity’s greatest challenges of this century.

In Nigeria, Africa’s most populous country, more than half of the 191 million citizens live in energy poverty. And though governments have been talking for years about extending national electricity grids to deliver energy to more people, they’ve made little progress.


With such a vast problem, what can be done?

Rather than focusing on the national electricity grid, Nigerian non-profit Renewable Africa 365, or RA365, is taking a different approach. RA365 is working with local governments to install mini solar power substations, known as renewable energy microgrids. Each microgrid can deliver electricity to serve small communities of 4,000 people. In this way, RA365 aims to address Nigerian energy poverty community-by-community with solar installations.

To be effective, RA365 needs to convince local policymakers of the potential impact of a microgrid in their community. For help they turned to Omdena. Omdena is a global platform where AI experts and data scientists from diverse backgrounds collaborate to build AI-based solutions to real-world problems. You can learn more here about Omdena’s innovative approach to building AI solutions through global collaboration.


Which communities need solar microgrids the most?

Omdena pulled together a global team of AI experts and data scientists. Working collaboratively from remote locations around the globe, the team set about identifying the regions in Nigeria where the energy poverty crisis is most dire and where solar power is likely to be effective. 

To determine which regions don’t have access to electricity, our team looked to satellite imagery for the areas of the country that go completely dark at night. Of those locations, they prioritized communities with large populations that incorporate schools and hospitals. Also the collaborators looked at the distance of those communities from the existing national electricity grid. In reality, if a community is physically far from the existing grid, it’s unlikely to be hooked up anytime soon. In this way, by analyzing the satellite data with population data, the team identified the communities most in crisis.

In any machine learning project, the quality and quantity of relevant data is critical. However, unlike projects done in the lab, the ideal data to solve a real-world problem rarely exists.  In this case, available data on the Nigerian population was incomplete and inaccurate. There wasn’t data on access to the national electricity grid. Furthermore, the satellite data couldn’t be relied upon. Given this, the team had to get creative. You can read how our team addressed these data roadblocks in this article from collaborator Simon Mackenizie. 


What’s the impact?

The team built an AI system that identifies regional clusters in Nigeria where renewable energy microgrids are both most viable and likely to have high impact on the community. In addition, an interactive map acts as an interface to the system.

AI in Nigeria

Heatmap with most suitable spots for solar panel installments

RA365 now has the tools it needs to guide local policymakers towards data-driven decisions about solar power installation. What’s more, they’re sharing the project data with Nigeria Renewable Energy Agency, a major funding source for rural electrification projects across Nigeria. 

With this two-month challenge, the Omdena team delivered one of the first real-world machine learning solutions to be deployed in Nigeria. Importantly, our collaborators from around the globe join the growing community of technologists working to solve Nigeria’s toughest issues with AI.

Ademola Eric Adewumi, Founder of Renewable Africa 365, shares his experience working with the Omdena collaborators here. Says Adewumi, “We want to say that Omdena has changed the face of philanthropy by its support in helping people suffering from electrical energy poverty. With this great humanitarian help, RA365 hopes to make its mission a reality, bringing renewable energy to Africa.”


About Omdena

Building AI through global collaboration

Omdena is a global platform where changemakers build ethical and inclusive AI solutions to real-world problems through collaboration.

Learn more about the power of Collaborative AI.

AI Applied: Providing Communities in Nigeria With Solar Energy

AI Applied: Providing Communities in Nigeria With Solar Energy

Energy grid analysis and AI to identify sites that are most suitable for solar panel installation across Nigeria.


By Omdena Collaborator Simon Mackenzie


Giving access to renewable energy in places most at need has the potential to solve many of the most pressing problems in today’s world.


The UN has identified a number of goals to make the world a better place.


The power of building AI through collaboration

Imagine if we could focus our AI efforts to help solve the biggest problems in the world — Poverty, Hunger, Health, Education.

As you are reading this you are probably interested in AI. Perhaps like myself you have completed courses in machine learning; put inordinate effort into a Kaggle project to reach the top 50; made hobby projects to generate twitter posts in the style of Trump or to tell the difference between an elephant and a giraffe. Perhaps you have a career in data science targeting advertising or building a chatbot for customer service.

And perhaps like me, you find these intellectually stimulating but that something is missing.

Surely there is a way to use AI to make a more positive impact on the world?

I recently joined an Omdena Challenge with a multi-national team to do just that. In the two-month AI challenge, hosted with Nigeria based NGO Renewable Africa, we focused on providing affordable, clean energy, which is a fundamental need in developing countries.


The problem: 100m people without electricity

Nigeria has a population of 200m people yet only half have access to electricity. Without electricity, there are no computers or the internet. There are no fridges to keep food fresh. There is no electric water pump. There is nowhere to charge a mobile phone.

Schools and Hospitals struggle to provide basic services. Widening electricity access is an essential first step for improving education, healthcare, and local economies.

Centralized planning for electricity in Nigeria has failed. The government has built-in failure by fixing prices and profits; there is widespread corruption, and banks will not lend to new power plants. Meanwhile, half the existing plants lie idle and the rest operate below capacity.

Millions live under the grid but not connected to it; previously connected but some equipment failed and has not been replaced, or they have electricity but only enough for a light bulb; or they have it but it is unreliable due to daily power cuts.

Can AI help to make a difference in Nigeria?


The solution



Off-grid, solar energy puts local people in control


One solution to this problem is localized, off-grid, renewable energy in the form of solar panels servicing small communities of up to 4,000 people. This protects against any single point of failure; puts the power in the hands of local people; and is more easily financed as it requires less capital and has faster returns.

The first step in implementing this is to prioritize where to put the panels. It would take 25K+ panels to provide a minimal electricity supply to everyone in Nigeria. We can apply data science to ensure the available funding is used efficiently to provide electricity to as many people as possible.


AI in Nigeria: Overcoming challenges

A major part of the challenge is to find the data; validate it, and label it. There is no shortage of data but what is out there was not always accurate or fit for our purpose.

Other challenges were to develop skills in analyzing geographic sources that were at varying resolutions and to find ways to present these in a simple way.

Nobody on the project had strong prior experience in these areas so it was a learning experience. Modeling was less of an issue for this project because there were many existing models that we could adapt to. Some of the data issues are described below.


Identifying the demand

Nigeria can be split into 775 census areas but it is a huge country so a single census area can be 1000km2. In that area, we need to know if the people are all in one town or scattered across hundreds of villages. The typical way to do this is to use satellite images to classify land use then allocate the census numbers to buildings by applying estimates of relative density. This can be enhanced with random forest to incorporate a multitude of other variables.

It sounds great in theory but in practice, the numbers just look wrong.

Furthermore, there is a question mark over the accuracy of the census. This is used by the central government to allocate funding so there is some incentive for local officials to boost the numbers. We can’t even rely on the total population being 200m in total for the same reason! At best this is an estimate and at the local level could be completely wrong.


The solution

A recent healthcare project found that none of these models worked on the ground. So they created a new, bottom-up, statistical model based on areas where they knew the population; then expanded this via micro-surveys.

This looks much more realistic.


Where to supply the electricity?


Satellite images show the existing electricity. The grid has been mapped using machine learning.


To find target sites we need to exclude those that already have electricity. In addition, those close to the grid were given low priority as they are more likely to receive it directly in the future. The volume of available and free satellite data is incredible. In particular there are night-time light images that clearly show towns that have light.

But how do we validate that? Here we can leverage the magic of google maps. I find it awesome to be able to zoom in on a road in Nigeria to see whether it has street lamps. Based on a selection of test towns it was possible to calibrate and validate the data from satellite images.

For the electricity grid, you may think the government and electricity companies would know where their cables are. But they don’t!

Fortunately, we could leverage an existing model to identify electricity cables that used a combination of machine learning on satellite images and human checking.



Our outputs


AI in Nigeria

Clusters of 4,000 people without electricity and more than 15km from the grid


Cluster analysis was applied to the population data to identify groups of 4,000+ people within a small radius filtering out those that already
had electricity or were close to the existing grid.

These candidate clusters were then combined with other data such as a solar irradiance model, health and education establishments. This produced a map showing potential sites together with a spreadsheet ranking the opportunities.

The map + technical explanation can be found here.




The sponsoring NGO Renewable Africa is now able to confidently survey a selection of sites that are suitable for solar panels. In addition, they are sharing the data with the Nigeria Renewable Energy Agency (REA). The REA is a major funding source for off-grid, rural electrification projects in Nigeria. The data collected will allow much better targeting especially outside major towns where government data is lacking.

More refined targeting will enable many more people to get electricity per $ of investment.

This means better healthcare, education and economies; and will potentially improve the quality of life for millions of people.

For the project team, this was an amazing opportunity to learn more about Nigeria, AI, renewable energy, satellite imagery, population modeling, and
technical skills in mapping and analyzing geographic data.


Most satisfying was using AI to solve a real problem affecting real people. AI is not about models but about solving problems.


About Omdena

Building AI through global collaboration

Omdena is a global platform where changemakers build ethical and inclusive AI solutions to real-world problems through collaboration.

Learn more about the power of Collaborative AI.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here