Satellite Imagery for Monitoring and Predicting Water Quality in Kutch Region
April 7, 2022
Introduction
India’s rapid population growth and the need to meet the increasing demands for irrigation, human, and industrial consumption, available water resources in many parts of the country are depleting and water quality has deteriorated. The discharge of untreated sewage and industrial effluents pollutes Indian rivers and water bodies, which in turn pollutes the country’s waterways.
Water quality monitoring is very important to obtain quantitative information about the characteristics of water and identify changes or trends in water quality over time, as well as to respond to emerging water quality problems, such as the identification of sediment, harmful algae blooms, salinity, dissolved organic matter and dissolved oxygen levels.
One disadvantage of measuring and monitoring water quality in situ is that it can be very expensive and time consuming. Satellite imagery is an alternative method for monitoring water quality. The team explored Sentinel 2, Sentinel 3, and Landsat 8 satellite imagery for water quality monitoring.
Water Quality Index and Parameters Identification
Satellite-based remote sensing is a cost-effective and efficient method for analyzing and quantifying water quality. It is possible to establish long-term baseline conditions for any region of the world using satellite data. It can also provide information on both the local and regional scales using near real-time satellite data.
There are several water quality parameters available, but the team has chosen the following 8 important parameters to monitor water quality.
- pH: pH is one of the most important parameters of water quality. pH is a measurement of how acidic or basic a solution is. pH is measured on a scale of 0-14. Solutions with a pH less than 7 are said to be “acidic”, solutions with a pH greater than 7 are “basic” or “alkaline”, and a pH of 7 is “neutral”. pH measurement is an indicator of acids and bases.
- Salinity: Salinity is simply the measure of dissolved salts in water. Salinity is usually expressed in parts per thousand (ppt) or ‰. Fresh water has a salinity of 0.5 ppt or less.
- Turbidity: Turbidity is the cloudiness or haziness of the water caused by large numbers of individual particles that are generally invisible to the naked eye. Suspended sediments, such as particles of clay, soil, and silt, frequently enter the water from disturbed sites and affect water quality.
- Temperature: Water temperature plays a substantial role in the aquatic system and can determine where aquatic life is found and the quality of the habitat. Warmer water has lower oxygen solubility, limiting oxygen supply.
- Chlorophyll: Chlorophyll-a is a commonly used measure of water quality as eutrophication level. The amount of chlorophyll found in a water sample is used as a measure of the concentration of phytoplankton. Higher concentrations indicate poor water quality, usually when high algal production is maintained.
- Suspended matter: Fine particles make up suspended matter. Some are naturally present in river water, such as plankton, fine plant debris, and minerals, while others are the result of human activity (organic and inorganic matter). Suspended matter can make water more turbid, which harms river and stream ecology.
- Dissolved oxygen: DO is gaseous, molecular oxygen in the form of O2 originating from the atmosphere. Dissolved oxygen concentrations in water are affected by temperature, and salinity. Solubility of oxygen in water is inversely related to temperature and salinity– as temperature and salinity increases, DO decreases.
- Dissolve organic matter (DOM): Dissolved organic matter (DOM) is defined as the organic matter fraction in solution that passes through a 0.45 μm filter. DOM also includes the mass of other elements present in the organic material, such as nitrogen, oxygen and hydrogen. In this case, DOM refers to the total mass of the dissolved organic matter.
Satellite Data Collection & Image Analysis (EDA)
Data Collection Using Satellite
Several satellites orbit the earth with sensors that could be used to estimate water quality parameters. The spectral, spatial, and temporal resolution of sensors can be used to compare and select them. A short list of sensors was compiled based on the literature.
Following indices has been used for water quality monitoring:
• NDWI: The Normalized Difference Water Index (NDWI) is an index for delineating and monitoring content changes in surface water. It is computed with the near-infrared (NIR) and green bands. NDWI = (Green – NIR) / (Green + NIR).
• MNDWI: The modified NDWI (MNDWI) can enhance open water features while efficiently suppressing and even removing built‐up land noise as well as vegetation and soil noise. It uses green and SWIR bands for the enhancement of open water features.
• NDSI: Normalized Difference Salinity Index.
• NDTI: The Normalized Difference Turbidity Index (NDTI) is used to estimate the turbidity in water bodies. It is also estimated using the spectral reflectance values of the water pixels.
It uses the phenomenon that the electromagnetic reflectance is higher in the green spectrum than the red spectrum for clear water. Hence, with increase in turbidity the reflectance of the red spectrum also increases.
• NDCI: Normalized Difference Chlorophyll Index is used for estimation of chlorophyll-a concentration in turbid water. It is calculated using the red spectral band B04 with the red edge spectral band B05.
Data Extraction Steps
• Step 1: Identify your region of interest or the geometry
geometry = ee.Geometry.Point([72.6026,23.0063])
• Step 2: Identify the satellite and use filters like date, filter bounds or cloud pixel percentage
sentinel = ee.ImageCollection("COPERNICUS/S2_SR").filterBounds(vectors) .filterDate(start_date,end_date) .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE',20)) .median()
• Step 3: Calculate Normalized Difference Water Index (NDWI) using the bands
ndwi = sentinel.normalizedDifference(['B3','B11']).rename('ndwi')
• Step 4: Extract the water parameters for example Chlorophyll, based on the regression formulae above
latlon = ee.Image.pixelLonLat().addBands(ndci) # apply reducer to list latlon = latlon.reduceRegion( reducer=ee.Reducer.toList(), geometry=vectors, scale=100); # get data into three different arrays data_ndci = np.array((ee.Array(latlon.get("ndci")).getInfo()))
• Step 5: Extract the water parameters for example Dissolved Oxygen, based on the regression formulae above
latlon = ee.Image.pixelLonLat().addBands(dissolvedoxygen) # apply reducer to list latlon = latlon.reduceRegion( reducer=ee.Reducer.toList(), geometry=vectors, scale=100, tileScale = 16); # get data into three different arrays data_do = np.array((ee.Array(latlon.get("dissolvedoxygen")).getInfo()))
• Step 6: Extract the water parameters for example Temperature, based on the regression formulae above
latlon = ee.Image.pixelLonLat().addBands(temp) latlon = latlon.reduceRegion( reducer=ee.Reducer.toList(), geometry=vectors, scale=100); data_lst = np.array((ee.Array(latlon.get("temp")).getInfo()))
• Step 7: Extract the water parameters for example Turbidity, based on the regression formulae above
latlon = ee.Image.pixelLonLat().addBands(ndti) # apply reducer to list latlon = latlon.reduceRegion( reducer=ee.Reducer.toList(), geometry=vectors, scale=100); # get data into three different arrays data_ndti = np.array((ee.Array(latlon.get("ndti")).getInfo()))
• Step 8: Extract the water parameters for example Salinity, based on the regression formulae above
latlon = ee.Image.pixelLonLat().addBands(ndsi) # apply reducer to list latlon = latlon.reduceRegion( reducer=ee.Reducer.toList(), geometry=vectors, scale=100); # get data into three different arrays data_ndsi = np.array((ee.Array(latlon.get("ndsi")).getInfo()))
• Step 9: Create the dataset using the extracted values from the above steps
df = pd.concat([pd.DataFrame(data_do, columns = ['Dissolved Oxygen']), pd.DataFrame(data_ndsi, columns = ['Salinity']), pd.DataFrame(data_lst, columns = ['Temperature']), pd.DataFrame(data_ph, columns = ['pH']), pd.DataFrame(data_ndti, columns = ['Turbidity']), pd.DataFrame(data_dom, columns = ['Dissolved Organic Matter']), pd.DataFrame(data_sm, columns = ['Suspended Matter']), pd.DataFrame(data_ndci, columns = ['Chlorophyll'])], axis=1, sort=False)
Data Visualization
Machine Learning
• Step 1: Data Pre-processing and Exploration
1. The data was gathered separately for the Kutch region’s Hamirsar Lake, Shinai Lake, and Tappar Lake, and then concatenated into a single data file.
2. After carefully examining the data, it was discovered that it had more than 60% null values; thus, replacing the null values or doing any imputation was not feasible, since it may result in imbalanced observations or skewed estimations.
3. The number of outliers was examined after the null values were removed. Only Dissolved Oxygen had data points beyond the Interquartile Region, however owing to the large number of them, they couldn’t be called unambiguous outliers.
4. The data was examined for multicollinearity between the parameters, but no significant correlations were discovered, with the exception of one relationship between Dissolved Organic Matter and Suspended Matter.
5. It was observed that we didn’t have clear class distinctions for salinity parameters due to lack of data from diverse saline regions. So we decided to drop the salinity parameter from training and add a salinity check condition for the final prediction as explained in the machine learning section.
• Step 2: Training and Validation dataset Preparation
1. To perform Supervised Machine Learning, we needed to add labels to the dataset after it was ready.
2. Due to a lack of in-situ data for training in India, we applied research-based thresholds (shown in the table) to categorise the records into ‘good,’ ‘poor,’ and ‘Needs treatment,’ and created our own training and testing data.
3. The data was labelled using the following criteria: –
- They were labelled as ‘Good’ if all of the parameter values were within the range of drinking water.
- Finally, it was labelled as ‘Poor’ if any of the parameters were out of the threshold value range.
- If any of the chosen parameters that did not fall inside the threshold value range were labelled as ‘Needs Treatment’.
4. The Min-Max Scaler was then used to normalize the data.
5. There were 989 entries for ‘Needs Treatment’, 50 values for ‘poor’ and 461 values for ‘good’ in the dataset, which indicated the imbalance nature of the data.
6. The unbalanced data was therefore balanced by using SMOTE to up sample it, resulting in 989 values for each class.
• Step 3: Machine Learning Models
1. Various Machine learning models were applied on the final dataframe, and the metrics were analysed and the best model was chosen with having a good validation accuracy. Among all the models we evaluated, Random Forest Classifier performed best and was used for the final deployment.
2. For the final prediction, salinity_class is the class predicted based on only Salinity and predicted_class is the class predicted by the model. The following were the 3 conditions for Final prediction:
- Final class is ‘good’, if both salinity_class and predicted_class are good.
- Final class is ‘poor’, if either salinity_class or predicted_class or both are poor.
- Final class is ‘Needs Treatment’, if both are ‘Needs Treatment’ or either of them is ‘Needs Treatment’ with the other being good.
3. Below are the confusion matrix and ROC curve for the final model.
4. The Classification report shows the Precision Recall and F1 score, for the validation set.
- Precision = True Positives / (True Positives + False Positives)
- Recall = True Positives / (True Positives + False Negatives)
5. F1 score provides a way to combine both precision and recall into a single measure that captures both properties.
- F1 score = 2 *( Precision * Recall) / (Precision + Recall)
- Precision: Correctly predicted class of a label/ sum of all predicted classes
- Recall: Correctly predicted class of a label/ actual class of that label
6. Confusion Matrix : A confusion matrix tells us the number of ways in which our model made correct, incorrect and confusing predictions.
- TP (True positive) = Diagonals of matrix
- FN (False Negative) = Sum of the corresponding row for class (excluding TP of that class)
- FP (False Positive) = Sum of the corresponding column for class (excluding TP of that class)
- TN (True Negative) = Sum of all the row and column (excluding row and column of that class)
Dashboard
Our dashboard has 7 tabs/pages as listed below :
- Home Page
It acts as the landing page of our dashboard having the problem statement – ‘Water Quality Centralized Dashboard for Better Decision Making’.
- About Page
It describes the following:
– Project Goals: to analyze, interpret and visualize the different water quality parameters and compare them with standard limits.
– Location Chosen: Kutch Region (Hamisar, Shinai and Tappar Lake)
– Developments Made: starting with parameters identification, spotting the useful sources and absence of in-situ data for the Indian region, it talks about how we were able to achieve the interactive dashboard that classifies the water of the selected region of interest using a machine learning model.
- Features Page
It is about the project endorsements which includes projecting the water quality, monitoring and analyzing existing conditions, identification of parameters with threshold values.
- Select AOI (Area of Interest ) Data Parameters Page
Water Body, Parameters, Latitude, Longitude, Start Date and End Date of the area of interest is set by the user.
- Visualizations Page
It displays the Lake Satellite Imaging using Sentinel
- Conclusion Page
It is broadly categorized into the tech stacks used, the project summary which talks about the water quality crisis, ability of decision making and real-time enforcement and finally the conclusion portraying the success of the centralized dashboard to check the real time water conditions.
- Team Page
It has the list of all the collaborators and the team lead linked to their respective linkedin profiles.
Dashboard Gif
Future Scope
- Decide airborne sensor or spaceborne sensor based on the size of the study area.
- Choose the remote sensor with the suitable spatial, temporal, and spectral scales according to the problem to be solved.
- Choose close dates of both remote sensor images and the in-situ samples to reach acceptable results.
- Apply the proposed model on different in-situ sampling datasets in different regions to ensure model applicability and robustness.
- Solve the problem of temporal scale (revisit times) by data fusion between over one remote sensing data to fill the gaps of missing dates.
- Use deep learning as a powerful machine learning algorithm to improve the prediction accuracy and try to solve the problem of extra bands of the hyperspectral sensors by using band selection and also generalize the model to predict the quality of water all-over the country.
Conclusion
According to UNICEF, one in nine people worldwide uses drinking water from unimproved and unsafe sources. Water quality is one of the main challenges that societies are facing in the 21st century, threatening human health, limiting food production, reducing ecosystem functions, and hindering economic growth. To align and meet the Sustainable Development Goals(SDG), the dashboard was built with different water quality parameters to monitor water quality more effectively and efficiently in real-time using satellite imagery and remote sensing techniques. As the traditional in situ methods are costly as well as time-consuming, by using advanced geospatial technology, water quality can be monitored spatially and temporally in near real-time and self-operating. This would help decision makers and stakeholders to make better decisions.
References
Numerous water quality monitoring articles have been studied by the team. A handful of the most notable sources are listed here:
- https://earthdata.nasa.gov/learn/pathfinders/water-quality-data-pathfinder
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017463/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6111878/
Github:
- https://github.com/robmarkcole/satellite-image-deep-learning: Resources for deep learning with satellite & aerial imagery.
Exploring the Data:
You might also like