Desertification Detection using Machine Learning and Satellite Data

March 24, 2022

In this article, you will learn how to build a forecasting model for desertification detection of different land covers in Iraq. The work stems from the Omdena Iraq Chapter.

Problem background

Desertification was described as the persistent degradation of dryland ecosystems by variations in climate and human activities.

In other words, it is making what used to be arable farming land into useless one. It is one of the greatest environmental challenges today and unfortunately mostly targets the world’s poorest population.

Desertification leads to so many other problems ranging from affecting the agricultural sector leading to more hunger, to increasing the displacement of people who used to live on these lands yields and what used to be green fields, which in return have its own set of problems.

Fortunately, most of this degradation can be reversed and treated by many methods and that’s why many reports have been published addressing this important topic and demanding immediate actions. It is also why most of the countries suffer from it, due to obvious disregard by the authorities of these regions and countries.

Land Cover of Iraq

According to recent reports, the rate of desertification in Iraq has increased to 39% and 54% of the country’s agricultural land faces drought and land degradation. According to a report by the Republic of Iraq Ministry of Agriculture, Iraq is losing 100 square kilometres annually from its arable lands as a consequence of desertification. Iraq’s highly excessive dependence on water that comes outside of its borders, the mismanagement of water, inefficient farming habits, and the already dry climate make it more vulnerable to climate change. Having more reliable sources to know where to focus the efforts could be the beginning of solving this huge challenge and providing immediate help to the most endangered regions.

In this work, we’ll discover some interesting facts that we are calling for further investigation and more dedicated projects to be initiated for this purpose to validate the results our team got.

Desertification Detection in Iraq

AI has proven to provide more and more accurate forecast results in recent years, allowing the formulation of solutions in a faster and more agile way than before.

Here we’ll work to harvest this technological advancement to help predict the most areas and regions that could fall victim to desertification in the upcoming years in Iraq. The goal of this project is to build a forecasting model that can predict the status of different land covers in Iraq. For this purpose, we went through the following steps:

Vegetation index analysis: Analyse the loss of green, degradation of lands in Iraq over the years (using NDVI, NDWI, and other indices).
Drought area analysis: Use supervised machine learning algorithms to classify different land type covers.
Times series analysis: Analyse and predict the desertification process in the next few years through the historical data.
LandUse Land Cover Classification using machine learning with Google Earth Engine: Using the cloud power provided by Google Earth Engine to train a land cover classifier.
Dashboard: Build a dashboard to visualize the areas affected and our future predictions using the Streamlit library.

In addition to raising more awareness on this problem, this work intends to also utilize AI and state-of-the-art machine learning algorithms to address such problems in Iraq.

We present to you the result of a 5 weeks project initiated by the Omdena Iraq Chapter. 4 different methods were implemented by the participants to analyze the effect of drought and desertification in Iraq. We are hoping by this we encourage more in-depth work to be conducted using AI and machine learning approaches. We’ll also present to you the dashboard the team created to showcase the results.

All resources and data used in this work are freely and publicly available.

1. Vegetation area analysis (Joyce)

The normalized difference vegetation index (NDVI) is a simple graphical indicator that can be used to analyze remote sensing measurements, often from a space platform, assessing whether or not the target being observed contains live green vegetation

Fig(1): What NDVI represents

Original image source

In this section, we will deal with the analysis of NDVI. One of the major concerns regarding data collection was the huge file sizes. Hence, we decided to use a few satellite images with the least cloud cover for our analysis. We downloaded the red, green, blue, NIR, and SWIR bands of all the images for our analysis. The file size of a single image was more than 1 GB. Hence, processing the images became a challenging task for the team. Instead of using tools like QGIS, we decided to process the images using the rasterio library in Python. Due to the large file size, we read the image data in blocks and calculate the indices of interest. A summary of what has been done for this task is shown below:

Region of Interest: Mosul – Iraq
Dataset: Sentinel2 images using Google Earth Engine
Period of study: 2016, 2018, 2021
Bands: 5 Bands downloaded (R, G, B, NIR, SWIR)
Processing method: Used rasterio to process the images
Indices: NDVI, NDBI, NDWI, and MSAVI

The following shows NDVI values of Mosul for three different periods, 2016, 2018, and 2021, calculated using data from Sentinel2. Each point is classified as water, bare area, low vegetation, medium vegetation, or high vegetation based on the values of NDVI.

Fig(2): Shows the NDVI values we got in 2016, Mosul – Iraq

The above figure shows an analysis of the NDVI values in the year 2016. According to the study, 63.9% of the land in the Mosul region is bare area. Even though there is some vegetation, most of it is only low vegetation. The low vegetation area corresponds to 30.8% of the area under study.

Fig(3): NDVI in 2018, Mosul – Iraq

The analysis of 2018 data shows a significant drop in the percentage of bare land. The bare land is only 35.9%. Even though it’s a good indication, we assume that there is some discrepancy in getting such a large difference in the 2 year time period. But, when we analyze the other values, we could see that there is an increase in the overall greenness.

Fig(4): NDVI in 2021, Mosul – Iraq

The 2021 analysis shows that the bare area has reduced from 63.9% in 2016 to 58.8% in 2021. We could also notice that there is an overall increase in the percentage of low, moderate as well as high vegetation when compared to the data in 2016.

2. Drought area analysis (Sai Villiers)

This task demonstrates how we used supervised machine learning algorithms to classify different land type covers. A summary of what has been done for this task is shown below.

Collected MODIS NDVI for April Month from 2000 to 2021
Established Vegetation Condition Index by calculating from Long-term Maximum and Long-term Minimum
Used ESA 10 m Land Use and Land Cover 50 Agricultural points are Sampled and using those Sample point to extract Vegetation Condition Index
Used a Standardised Threshold of VCI < 40 is applied for Drought conditions and VCI> 40 for Non-Drought Conditions.
Calculating the Drought Frequency means whether a pixel is a drought in a particular year then the values are summed up for all the years to establish whether a point is a drought or not.
Used QGIS or Streamlit the Data has been published
Used supervised learning methods were used to find drought vs non-drought areas

Fig(5): Our methodology in summary

2.1. Region of Interest

It was important for us to select our points from various regions in Iraq and with different drought levels. Even though we know covering more regions would have given even more accurate results but given more time more regions could definitely be included. Having said other approaches from the team took into consideration other regions weren’t included in this specific method. This help for sure give you an overall view on desertification status all over the country.

The figure below shows the group of points that were included in the drought analysis that was conducted in our project.

Fig(6): Region of Interest

Fig(7): Distribution of agricultural sample/ Land use and landcover of northern Iraq using ESA 10m

Results & Visualizations

Fig(8): Vegetation condition index of April 2000 using MODIS NDVI/ Drought status estimated from drought frequency

And below we can see the result values of the analysis for each of the locations. Here we remember that each pixel has a resolution of 10m on the ground.

Fig(9): Drought analysis results for different locations

2.2 Limitations

One limitation is downloading the result and analyzing the result in the form of an image which results in a huge size of minimum 8GB data.

3. Multiple Time Series Analysis (Liangliang)

3.1. Introduction

The focus of this part is to analyze and predict the desertification process in the next few years through the historical data. The challenge of this task is that the amount of data in a single Image collection is overwhelming, which causes great difficulties of data processing for many years. Since the current land conditions can be seen intuitively from Google Earth Engine. Limited, representative areas and indicators are selected to obtain data penetratingly, which will greatly compress the amount of data that needs to be processed.

3.2. Method

The current prediction model is only demonstrated as the most simplified data processing pipeline. to show how data from Google Earth Engine can be connected and processed into usable data that can be fed into most machine learning models, such as: Light GBM, Xgboost or other neural network for multiple outputs. The data for the specified area and the time period are downloaded from Google Earth Engine. In the prediction model, we selected 20 years of NDVI data in two locations to form a basic multiple time series model. Depending on the study area, more locations and longer periods can be selected for processing. Model selection and parameter optimization are not the focus of this study.

Fig(10): Region of Interest for the time analysis

3.3. Dataset

The MOD13Q1 V6 (MOD13Q1.006 Terra Vegetation Indices 16-Day Global 250m) provides a Vegetation Index (VI) value at a per pixel basis for every 16 days which has been masked for water, clouds, heavy aerosols, and cloud shadows. The value range of NDVI on land is generally between 0-1. negative values caused by missing values of that day are removed in the model. To exclude the effect of seasonal changes, we averaged 23 NDVI values in a year to get an annual mean. This calculation also can be done in Google Earth Engine, which will greatly reduce the amount of data that needs to be downloaded, especially when we need to study more locations. One advantage of using MODIS is that it has labels for different land types, which can be easily chosen in Google Earth Engine to compare the areas to be studied.

Code for downloading historical NDVI data:

https://github.com/OmdenaAI/iraq-chapter-desertification-detection/blob/main/src/tasks/task-4-Time-Series-Analysis/Downloading_NDVI_values_from_google_earth_engine.ipynb

Fig(11): Table shows the data downloaded from GEE

Fig(12): Table shows the calculating of the NDVI annual mean

3.4. Regional sampling

Taking NDVI samples and averaging them allows us to obtain NDVI curves covering regions of different sizes. By studying historical data and forecasts, we can obtain an overall impression of NDVI changes in the selected area.

Sampling points over a wider region can help us to get an intuition of changes in NDVI from a broader perspective. For example, by selecting points over the entire country and taking the average, a sampled NDVI curve of the whole country can be obtained.

Then, According to the classification and characteristics of the objects under investigation, we can step into more specific regions and land types gradually to analyze lands with the same characteristics.

Fig(13): The NDVI values samples from different locations

3.5. Key area

According to the characteristics of sand dune movement and expansion, the edge of the desert is the key area to be investigated. precipitation is another factor to consider. It has a certain regular pattern in a specified area, and combined with other features, it should be able to predict the results more accurately.

Fig(14): A closer look into the selected locations. You can see some located right at the edge of the desert.

3.6. Feature engineering

In the feature engineering part, more features such as Mean, Max, Min, the standard deviation can be added to the model. Regarding the choice of the number of lags, a single time series can be decided by autocorrelation plot, but for multiple time series, it is no longer applicable, only considering individual autocorrelations could miss lags that are important only jointly.

3.7. Validation limitations and solutions

Cross-validation mainly works for the test data when historical data is available, overfitting would be a problem when we over-optimize the model based on historical data, The accuracy of future out-of-sample data forecasts will be greatly reduced. One solution is to pick out the desertified area that has gone through the entire change process, with similar characteristics and beginning NDVI curve, as the validation for areas that are being desertified.

3.8. Historical and forecasted NDVI values in selected locations

And below we can see the results of our time series analysis for both of the locations.

Location 1: Al Fallujah District

Coordinates: [44.63688130699558,32.232152677425034]

Fig(15): Historical and forecasted NDVI resulted from our analysis in location 1

Location 2: Al-Hamdaniya District

Coordinates: [45.73147467898915,31.334553471770587]

Fig(16): Historical and forecasted NDVI resulted from our analysis in location 2

4. LandUse Land Cover Classification using machine learning with Google Earth Engine (Deepali)

Land Use Land Cover (LULC) classification has been carried out using Google Earth Engine(GEE) which is a cloud-based geospatial analysis platform. LULC maps are very useful in analysing landscape patterns and detecting the changes that take place over a period. Random forest models have been built.

The following process has been adopted for creating LULC classification map in GEE

4.1. Selecting area of interest – Central part of Iraq is chosen for creating the LULC map.

Fig(17): Region of interest for land cover classification approach

4.2. Importing and filtering image collection

USGS Landsat 8 Surface Reflectance Tier 1 dataset is used for LULC classification. This dataset is atmospherically corrected for surface reflectance from the Landsat 8 OLI/TIRS sensors.

The following steps were involved:

Filtering satellite images by date
Selecting images that have cloud cover less than 5%

var data_18 =ee.ImageCollection("LANDSAT/LC08/T1_SR")
.filterBounds(aoi)
.filterDate("2021-01-01", "2021-12-31")
.filterMetadata("CLOUD_COVER", "Less_than
.map(maskClouds )
.map (indices)
.mean();
print (18_data)

Two functions were created:

1. Removing cloud shadow and clouds from images by masking and extracting all cloud shadow data from pixel_qa Bitmask for Bit 3 and clouds data from Bit 5.

Selecting pixel quality bands (pixel_qa) from Landsat image and masking that out with eq(0) indicating clear conditions.

2. Creating functions for calculating Normalised Difference Vegetation Index (NDVI) and Normalised Difference Built-Up Index (NDBI). For NDVI bands B5 and B4 are used and for NDBI bands B6 and B5 are used.

Map both the functions; cloud removal and indices
Calculating NDVI for differentiating vegetation from croplands
Calculating NDBI for built-up area

Fig(18): Normalised Difference Vegetation Index(left)/ Normalised Difference Built-up Index(right)

Fig(19): True colour and false colour composites

4.3. Collecting the training points- 20 sample training points for each of the following 5 categories were collected

Dense vegetation or woodlands
Moderate vegetation or crop lands
Bare soil
Built up area
Water

4.4. Merging training point- After collecting the sample points, all sample points were merged

var classes = water.merge(Dense_Vegetation_Woodlands)
.merge (Moderate_vegetation_croplands)
.merge(Bare_soil)
.merge(Built_up_area)

4.5. Assemble samples for the model

– First set the geometries selected for training using collection: classes then assign a label from each property and select scale =30 as 30m spatial resolution of Landsat 8.

– Splitting the data 80:20, 80% as training and 20% as testing

var image = data_18.select(bands);
//assemble samples for the model
var samples = image.sampleRegions({

collection: classes,
properties: ['landcover'],
scale: 30,

}).randomColumn('random')

var split = 0.8; // Roughly 80% for training, 20% for testing.
var training = samples.filter(ee.Filter.lt('random', split));
var testing = samples.filter(ee.Filter.gte('random', split));

4.6. Create the classifier and classify the image – Building a model using Random Forest classifier

– Number of trees = 5,

– Training the model using bands and landcover property ‘B5’,‘B6’,‘B4’,‘ndbi’,’ndvi’,‘lc’

var classifier = ee.Classifier.smileRandomForest(5).train({
features: training.select(['B5','B6','B4','ndbi','ndvi', 'Ic']),
classProperty: 'lc', //Pulling the landcover property from classes
inputProperties: bands
});

4.7. Create the classified image – The classified image has been created

Fig(20): LULC classified image

4.8. Checking the model accuracy: The Random Forest model could classify most of the regions correctly with an accuracy of 82%.

print(classifier.explsin());

var validation = testing.clessify (classifier);
var testAccuracy = validation.errorMatrix('lc', 'classification');
print('Validation error matrix RF: ', testAccuracy.accuracy());

print('Validation overall accuracy RF: ', testAccuracy.accuracy());

var classed = image.select(bands) // select the predictors
.classify(classifier);

Validation overall accuracy RF:
0.8260869565217391

For this task, we implemented a Streamlit library with python language to visualise and dashboard the results.

5. The Dashboard (Mohammed Zuhair Al Taie)

Below we’ll guide you through the different pages that were developed and we are providing the link for you in the end as well.

5.1. Home page

This page describes the importance and motivation behind this project

Fig(21): Dashboard home

5.2. Basemap

You’ll be able to see the map of Iraq. No analysis or results are to be shown here.

Fig(22): Dashboard basemap

5.3. Different Analysis Results

Fig(23): Vegetation Area Analysis

Fig(24): Drought Area Analysis

Fig(25):Time Series Analysis

To get a closer look, feel free to visit the dashboard and take a look at our results here!

Conclusion

For the vegetation area analysis, it was shown that the percentage of bare area in this region has reduced from 63.9% to 58.8%. This seems to be a very good indication. Also, the overall vegetation in the area seems to be getting better. The results clearly show that the arid area is reducing and the green area is increasing, which seems to be a good indication.

For the drought area analysis, we can see that many of the selected are drought, and a few of them are not.

For the time series analysis, and for the first location, NDVI curve shows a fluctuating behaviour, between up and down, over that last fifteen years, which gives an indication that this location will be part of this trend. However, the trend for the second location shows that NDVI has an increasing behaviour over the last fifteen years. This can partially be related to the nature of the land in both locations.

This article is written by Deepali Bidwai, Joyce Annie George, Liangliang Ji, Sai Villiers, Mohammed Zuhair Al Taie, Rasha Salim.