By Łukasz Murawski, Alexander Epifanov, and Jayasudan Munsamy.
This article is the result of working in Omdena’s AI project to estimate crop yield and crop classification with the UN World Food Program in Nepal by using satellite imaging analysis. The problem was tough, challenges were huge, and resources scarce. Still, a community of 36 collaborators managed to build a solution with 89% accuracy. This article focuses on the dataset creation.
Feel free to also read the following article regarding 10 Hard Lessons Learned For Creating a Dataset in Our Crops Identification Challenge to Fight Hunger is given here.
The characteristics of a ‘good’ dataset
Rubbish in — rubbish out’ is a popular phrase referring to poor ML model results caused by poor quality datasets.
No matter how good a model architecture is, it’s useless if not trained with appropriate good data. At the same time, the opposite holds true — sometimes even the basic model architecture is enough to get the job done if fed with a proper dataset for training.
And while so much has been written about ML model building, we think that dataset preparation has been a bit disregarded and this is understandable because it ain’t too sexy of a subject and takes forever to get it right. Unfortunately, our path to realizing the importance of ‘creating a good dataset’ was painful and agonizing.
So what really is ‘good’ dataset? Every project is different and has unique requirements for its dataset.
Let’s start with the characteristics of an expected outcome — our ML model. It seems that a good model is the one that achieves best results on a variety of input data — we say that model generalizes well and we call it ‘robust’.
In short, to achieve this, we need to train the model with high quality, precisely labeled datasets representing the full spectrum of input possibilities. The importance of data in ML can be understood from the fact that ‘typically 80% of an ML project is spent on data — analysis, gathering & engineering’.
The key characteristics of a good dataset are listed below:
- Data distribution — data should cover all or most of the possible spectrum of the input
- Data coverage — every class should have enough representation in the dataset
- Data accuracy — data should be highly relevant to the task in hand and be as close as possible to that used for inference, in terms of quality, format, etc.
- Feature engineered — data should enable the ML model to learn what we intend it to learn (appropriate features)
- Data transformation — almost always data acquired cannot be used as-is and an appropriate data transformation pipeline can simplify the model architecture
- Data volume — depending on whether the ML model is built from scratch or learning transferred from another model, availability of data is critical
- Data split — typically data is split into 3 chunks: training (75%), validation (15%) & test (10%) and it’s important to ensure there is no ‘duplicate/same’ data across these chunks and the samples are distributed properly
One of the key challenges with ML projects is that exact requirements for data are NOT known at the time of data analysis & gathering and sometimes is known only after the model is built and shortcomings understood. So, an iterative approach to creating and refining datasets in order to improve model metrics is a safe bet.
The Solution: Seven recommendations for creating satellite imaging analysis datasets for crop classification
Satellite images can be invisible colors (RGB) and in other spectra, e.g. data within specific wavelength ranges across the electromagnetic spectrum like Near-Infrared. There are also elevation maps, usually made by radar images which can be used to estimate vegetation growth rate, etc. Normally, the interpretation and Satellite Imaging Analysis method are conducted using specialized remote sensing software but advancements in AI have made autonomous, large scale analysis of satellite imaging analysis possible such as Crop Classification. We have listed some of the key points to consider.
- Ground truth data — know the ground truth
- Source of satellite images — decide the source
- Spatial distribution — know the terrain
- Temporal distribution — know the crop growth cycle
- Image quality — know what’s in the images
- Vegetation indices — know the right indices
- Labeling and Masking — know what is what & wherein the images
#1 Ground truth data — know the truth
Almost every stage of the project is dependent on the Ground Truth (GT) data provided. In satellite imaging analysis for the crop classification, it should contain all the details about the crop fields, which will help to identify them individually, so the information can be fed to ML models via appropriate datasets for correct feature extractions. In many cases, GT data is in the form of a file containing the required information gathered during field surveys. Usually, it’s a simple spreadsheet file filled with a full variety of field information that can be directly used for labeling. But in practice, we may not have all the required information and should be cautious to pay attention to the content of the file for two main reasons:
- Acquiring comprehensive field survey results is expensive, if possible at all
- Preparing proper Ground Truth data file requires an understanding of all the details required for creating robust ML models
In short, the goal of the Ground Truth data should be to provide complete, well-balanced, and properly distributed data. It should also serve as a reference on how to recognize different objects of interest to facilitate complete and reliable labeling.
It should have the following characteristics:
- Contain all the required details of data (crop parcels in agriculture case) — for example in case of crop type identification GT should specify crop field id, their dimensions, GPS locations, shape, size, crop cultivated, seasonal crops for the field, crop cycle details, land use patterns, etc.
- A well-balanced number of classes (Ex: equal or similar number of samples for each class)
- Well distributed spatially (various terrains in area of interest) and temporally (covering various time periods like seasons/crop cycles, etc.), representing the full range of possible scenarios
- Number of data points must exceed the expected number of images since for some/many of them there won’t be good satellite images (due to weather conditions, pollution, etc.)
- Should contain samples from few different years, so in case a year’s satellite images cannot be used for bad weather conditions or some other reasons, we can utilize images from other years mentioned in ground truth data to compensate for data loss
- Should highlight periods when objects of interest (say different crops) are easiest to be recognized. For example, months when the crops are in the fully grown stage to help in easy labeling (from there, we can easily propagate masks knowing the vegetation specifics)
- Should include examples of a similar kind of crops/vegetation (not just visually but also in terms of VIs if possible) in the nearby regions around the area of interest. For example, if rice fields are the class of interest, examples of grass, cornfields, etc. in the surrounding area which look similar to rice field should be also included
The key point to remember is that Ground Truth data quality will have a big impact on the dataset created, labeling/masking done on a dataset, and ultimately results of the solution.
#2 Source of satellite images — decide the source
With the advent of satellites launched by many countries and private organizations, Satellite Imaging analysis has become more accessible to the general public for a variety of applications, in our case, crop classification. Some of the more popular programs are Landsat (by USGS & NASA, 30m resolution since the early 1980s), MODIS (by NASA, near-daily satellite imagery of earth in 36 spectral bands since 2000), Sentinel (by ESA, 5 days frequency of earth in 16 spectral bands since 2016) and ASTER (by NASA, detailed maps of land surface temperature, reflectance, and elevation).
Organizations selling Satellite Imaging analysis techniques
Several private organizations sell raw & processed satellite imagery with customized data as required by customers. Few popular ones are GeoEye (since Sep2008, images with a ground resolution of 0.41 meters (16 inches) in the panchromatic or black and white mode also has multispectral or color imagery at 1.65-meter resolution or about 64 inches), DigitalGlobe (imagery with 0.46m & 0.6m panchromatic only spatial resolution, also images with 0.31 m spatial resolution), OneAtlas platform (by Airbus, Optical & Radar Earth Observation), Spot Image (by Bratislava, images with 1.5 m for the panchromatic channel, 6m for multi-spectral and 0.50 meter or about 20 inches) and ImageSat International (also known as “EROS” satellites, images can be used for mapping, border control, infrastructure planning, agricultural monitoring, environmental monitoring, disaster response, training, and simulations, etc.).
Key decision points for choosing the satellite imaging source, for crop classification:
- Raw or processed datasets — we can’t use raw satellite imaging analysis and processing of satellite images for crop classification is an involved activity using various tools: so processed images which are available readily as part of datasets are a good choice to start with
- Image quality — sharp images with clear differentiation of objects we are interested in is critical: the higher the resolution, the better the results with the ML model
- The spatial resolution of images — it’s the area on the ground covered by a single pixel in the satellite image: the lower the resolution (they go down to 15cm), the better they are. Few organizations improve the spatial resolution of final images by applying scaling techniques which sometimes may cause the undesirable quality of images and hence need to watch-out (ex: few bands of Sentinel2 dataset are resampled/scaled with constant Ground Sampling Distance metric depending on native resolutions of the bands and hence can have a spatial resolution of 10m, 20m or 60m, but the corresponding images will be low quality due to sampling)
- Free or paid — may seem like an easy choice, but various aspects like quality, the processing is done, completeness of data, etc. in the provided datasets depend on effort spent by the provider: the source of images used at inference time and accuracy/other metrics of ML models mainly drive the decision
- Temporal images coverage — depending on where, when & purpose of the satellites that were launched, imagery may be available only for certain geography and period (ex: Sentinel2 Level1C dataset has images only from June 2015 onwards): temporal data required for the task will help to decide
- Spectral bands to use — depending on the sensors in satellites, various spectral data will be available in images (ex: Sentinel2 imagery contains 13 spectral bands): usage of various vegetation indices for the type of remote sensing task will help to decide
- Number of images per day/week/month — depending on the frequency of satellite orbiting over the area of interest, number of images may vary (ex: Sentinel2 orbits over a location every 5 days, so in a month, there will be 4 to 6 images of a particular location: volume of images required will help to decide
- Image processing should be done — image quality decreases with various factors like cloud cover, haze cover, pollution distractions, etc. many organizations apply various processing techniques to get rid of such distractions in images: image quality requirements for the task at hand will help to decide
#3 Spatial distribution — know the terrain
While RGB bands in satellite images can show the crop fields, the terrain of these fields also plays an important role. For example, crop fields in plains tend to be large, with more regular shapes and similar crops are usually in the neighborhood; whereas in hilly areas crop fields tend to be small, different shapes & altitude and mix of other vegetation may be surrounding the fields; similarly in forest areas crop fields tend to be surrounded by thick trees without a clear visual representation of fields and their borders. Hence understanding the various terrains the crop fields are in becomes important.
The recommendation is to look at satellite imaging analysis from different time periods/seasons to understand the terrain of an area of interest, include images from various terrains in the dataset, consider the challenges with images from different areas while labeling/marking and address those challenges as much as possible with appropriate labeling like crop classification.
#4 Temporal distribution — know the growth cycle
Satellite images from all the months covering various growth stages of crops should be added to the dataset. These images will help the ML model to generalize well and be able to accurately identify crops irrespective of the growth stage of crops. A better understanding of the crops’ growth cycle and seasonal crops cycle can help to find satellite images of crops at different stages of growth.
While looking for temporal data, there is a possibility that a few months in a specific year do not have any images due to bad weather or climatic conditions. In such cases, consider choosing images from other years for these months. An important assumption that needs to be validated here is that the crop cycle & seasonal crops for those years are the same as the year with ground truth. In some cases, the same crop can have a different growth cycle in different regions.
In our case with Nepal, there were 10 to 12 varieties of rice widely adopted by farmers, having two main growing seasons depending on rice variety: 1) Spring rice (February/March to June/July): Chaite 2, Chaite 4, Ch 45, Bindeswar, etc. and; 2) Main season rice (June/July to October/November): Mahsuri, Savitri, etc. (Source).
Another important point to consider regarding temporal data is the land use pattern of cultivated fields. Though there might be defined crop cycle for each crop, there can be scenarios where the same crop fields are used for different crop cultivation in different seasons (ex: seasonal short-term crops may be cultivated in the same fields after main crop’s harvesting is done and before next sowing). Missing out on satellite images from different months representing seasonal crops will result in an incomplete dataset, resulting in inaccurate ML models.
Talking to the subject matter experts/farmers in the area of interest to understand the temporal data to be captured is critical at this stage of dataset creation.
#5 Image quality — know what’s in the images
The resolution of satellite images is relatively high and image processing is time-consuming. Similarly, depending on the sensor from which the imagery was created, appropriate processing is required before consuming the images. For the same reason, weather (rain, clouds, etc.) & environmental (pollution, haze, etc.) conditions can affect image quality. For such reasons, publicly available satellite image datasets are typically processed for visual or scientific commercial use by third parties.
Just like any other digital image, the resolution of satellite images is critical for the purpose and varies depending on the instrument used and the altitude of the satellite’s orbit. There are four types of resolution when discussing satellite imagery in remote sensing: spatial (pixel size of an image representing the size of the surface area being measured on the ground), spectral (wavelength interval size and number of intervals), temporal (amount of time/days that passes between imagery collection periods for a given surface location) and radiometric (levels of brightness/contrast).
Though there are open datasets of satellite imaging analysis available to the public free of cost, quality images to be used for specific purposes like crop growth detection, crop classification, crop type identification, etc. are expensive. The higher the quality, the higher the cost. As simple as that. But not always high-quality images are required. For example, for a project with the objective of identifying building structures in satellite images, we may not require images with spatial resolution as low as 30cm or so. So, there’s no standard guideline or single rule suggesting the minimal or maximum image quality required for a project. It all depends on the objective of the project. However, there are two main factors which need to be considered before deciding on the quality of images to use:
- ML model’s accuracy/performance metrics — will the image quality is chosen help to meet the high-performance requirements of the project?
- Data labeling potential — will the image quality is chosen to be good enough for labeling given the kind of objects to be identified from images?
The decision on point 1 is quite obvious — we can test the ML models using images of different quality and choose the one that meets the project’s requirements/metrics with the lowest quality images. The decision on point 2 is more subjective in terms of what objects are to be identified on images and should be agreed upon at the beginning of the project by carefully assessing the labeling capabilities. But, the image quality must be good enough for labeling, so people can easily see and draw masks around all objects/classes of interest. Solutions that require small objects to be identified, where distinguishing edges is more important than counting the overall coverage, require very high-quality images. That was the case with Omdena’s Trees Recognition Challenge, where the goal was to identify trees close to electricity lines in order to prevent power outages and fires sparked. Here, extremely accurate masking, close to the tree’s edges were necessary. For that, high-resolution 0.5m spatial resolution pictures had to be used. Thus, not only trees but also little bushes and shadows had to be precisely annotated. And that paid off. With only 150 original images and very basic transformations, using Deep UNet Model, the team achieved around 95% accuracy.
For our crop identification project in Nepal, such a high resolution was not necessary as crop field shapes were pretty much regular (except for the ones in hilly areas), with borders being mostly straight lines. So, in this project, the ability to distinguish similar objects (like rice vs. grass) at the labeling stage was the main factor to decide on image quality. We ended up using Sentinel2 Level-1C satellite images with 13 spectral bands from the Copernicus program (European Space Agency) with a maximum spatial resolution of 10m per pixel for certain bands. Unfortunately, it turned out that the max zoom level to get clear RBG images was only 100m. And that zoom level was not good enough for labeling since the crop field areas appear too small in the agricultural setting as seen in the images below (actual images used were 500×500 dimension and yet many fields appeared too small /unclear to recognize).
Since data from RGB channels are not enough for our AI model to identify crops, we could use data from the other spectral bands in Sentinel2 imagery to calculate different vegetation indices, include them in images and then train ML models with that dataset. As suggested in this paper, the dataset can be based on images with other bands including RGB and appropriate vegetation indices.
However, without good RGB images, the problem of proper labeling/masking persists. And it seems that there are basically very few options as listed below:
- Assuming that the final solution can’t be based on commercially obtained high-quality satellite imaging analysis, they might be necessary at-least for one-time masking/labeling, just at the model creation phase. Those masks can then be used to train the model using multi-spectral bands and calculated appropriate vegetation indices as bands in images for crop classification.
- Make sure the ground truth data defines the dimensions of every field precisely by using GPS coordinates to help labeling/masking team mark crop fields accurately.
- Draw masks on satellite images automatically using GIS software by physically being in-the-field with teams of people with GPS devices.
#6 Vegetation indices — know the right indices
As seen in the above section, for specialized tasks like differentiating vegetation types, it is required to analyze data contained in other spectral bands (ranging from 3 to 16 bands) of satellite images other than just RGB bands. This is where Vegetation Indices (VIs) play a critical role. Vegetation Indices are combinations of surface reflectance at two or more wavelengths designed to highlight a particular property of vegetation.
They are derived using the reflectance properties of vegetation. Each of the VIs is designed to accentuate a particular vegetation property. Satellite images from the various organizations have a varying number of spectral bands containing data useful for VI calculations. For example, Sentinel 2 satellite imagery from ESA is a wide-swath, high-resolution, multi-spectral imaging mission supporting land monitoring studies — vegetation, soil, water cover, inland waterways & coastal areas and have 13 spectral bands containing various top of atmosphere (TOA) reflectance levels which can be used for a variety of VI calculations.
More than 150 VIs have been published in scientific literature, but only a small subset have a substantial biophysical basis or have been systematically tested. Many tools/platforms provide support for calculating various VIs. For example, the SNAP platform part of Sentinel Toolboxes supports around 21 VI calculations and the ENVI Image Analysis platform provides 27 VIs to use to detect the presence and relative abundance of pigments, water, and carbon as expressed in the solar-reflected optical spectrum (400 nm to 2500 nm). VIs can be broadly categorized into the following groups — Broadband Greenness, Narrowband Greenness, Light Use Efficiency, Canopy Nitrogen, Dry or Senescent Carbon, Leaf Pigments, and Canopy Water Content.
The below table shows few veg indices applicable to differentiating Rice, Wheat & Other crops with respect to the Sentinel2 Level1C dataset of Nepal’s specific cultivation areas. The analysis was done based on data from 30+ sample crop fields of each category. Unfortunately, the values for VIs of crops vary from region to region based on their climatic conditions, temperature, soil conditions, etc. and hence study/analysis done for crops in one region cannot be reused for the same crops in other regions and hence need to be calculated for each region/area of interest.
Specifically in case of ‘crop type identification’, calculating the chosen vegetation indices, adding them as bands /channels in images dataset, and then feeding them to ML models will help model learn correlations between crop type and veg indices and ultimately identify different crop types with high accuracy. So, do not forget to explore various VIs suitable for the task in hand.
#7 Labeling/ Masking — know what & where
As per the ML model architecture Classification or Segmentation, objectives of labeling will change:
- For classification — diversified dataset of pictures labeled with one class per image
- For segmentation — every parcel belonging to each class of interest needs to be annotated/masked in every image, along with other objects around them
Since labeling for segmentation task is more demanding, here are few points to do it right:
- Every crop field belonging to every class of interest should be masked. Omitting/mislabeling and not being accurate enough will result in false negatives/positives at the model training stage and will impact the model performance
- Labeling needs to be accurate with regards to objects of interests in the image — say crop field/parcel borders: ideally, every parcel should have its own, clearly outlined border
- If the labeling is required for temporal data (satellite imaging analysis a location is taken at various time intervals like week or month or year) like in the case such as crop type identification project and crop classification, we can manually label only images with objects when they’re in the easiest to recognize stage. That means, in crop type identification project, we can label only images with crops in a fully grown-up stage. Since the location of the fields in images don’t change across temporal images, masks from those images can be propagated to the other temporal images, saving a lot of time & effort
- To ensure our model learns to differentiate crop types cultivated at the very same fields during different seasons (Ex: rice-wheat-fallow, rice-winter maize-fallow), it is important to annotate not only the main seasonal crops but also samples of other crops cultivated in the same field
- Labeling should enable ML models to differentiate objects of interest in images from the ones that look similar to the main objects. So, it is a good idea to have samples of similar objects in a dataset and labeling them appropriately. For example, grass & little bushes might look similar to the rice field on RBG images. So creating a separate label category for these similar-looking objects is a good idea
- Similarly, areas/environments surrounding the main object of interest in an image also should be labeled to help highlight differences in them. For example, in the case of crop identification project, images with rice fields near river banks or on the river bed should be annotated with river and riverbank as well. Since the rice field in the initial 1 to 5 weeks or so will be filled with water, the model should be able to differentiate rice fields during the first few weeks and actual river or a pond nearby.