How AI Can Protect Our Water: Detecting The Invisible Threats Within

April 5, 2024

The Problem

Water is a source of life, and its quantity and quality are of utmost importance to human life. The United States and many other countries are committed to providing clean and safe water for most of its residents, but water borne diseases continue to be a challenge.

Reports from the CDC confirm that about 7.5 million water borne illnesses occur annually and approximately $3.3 billion in healthcare costs (in the USA)

This means that these lives are continuously endangered – because contaminated water is tightly connected with the transmission of diseases such as cholera, diarrhea, type-A hepatitis, typhoid fever, dysentery, and polio.

The main causes of these diseases are microorganisms, viruses, and fecal matter in the drinking water because of aging infrastructure, chlorine-resistant pathogens and an increase in recreational water use. While water quality monitoring systems exist as part of a framework for many water infrastructures, several challenges such as timely availability of the results, reliability of the data, performance of existing systems, robustness of the system, and interoperability of the results to the end users, need to be improved on.

https://www.cdc.gov/healthywater/surveillance/burden/resources.html

The Background

Nepal's 1st hydro dam

Annually, 4 billion cases of water-related diseases cause 3.4 million deaths worldwide, which is a leading cause of deaths especially in children under 5 years who die of water-related diseases. The situation is much worse in the rural areas of many of the developing nations.

Most developed and developing nations have governmental or private infrastructure in place to assure the quality of drinking water and minimize the health effects caused by water-borne illnesses. A lot of this infrastructure is based around testing water samples from different water sources to see the prevalence of disease-causing microorganisms (among other inorganic health hazards. However, doing this at scale requires high a high amount of investment and resources.

The Goal

This project aimed at creating a low-cost method of detecting microorganisms (bacteria) in drinking water, therefore reducing the time required for water quality testing. The project was designed to account for the majority of the microorganisms in drinking water, and to be more user friendly at a local scale. The detection method will include a binary classification of the microorganisms as harmful/not harmful, because users will probably not have the knowledge to visually determine if a microorganism is harmful/not by just using their names as a reference system.

The four main objectives of this project were:

Create a low-cost method that is easy to access and easy to use for detecting microorganisms (bacteria) in drinking water.
Access suitable data of common microorganisms in water, to train a deep learning model.
Train a CNN (or equivalent) to recognize and classify bacteria using Computer Vision techniques.
Deploy the trained and tested model on a mobile phone.

Our Approach

Coming up with the Design Concept

The aim was to develop an object detection deep learning model to identify the presence of a microorganism in a water sample. Since the method heavily relies on differentiating shapes, the application was possible for this study because microorganisms differ in shape at least by the ‘genus’ classification.

Object detection and classification techniques are very popular techniques in the field of computer vision. Object detection involves using images/videos/camera feeds as inputs in a system and enables the recognition and classification of the objects in a frame in real-time.

Object classification goes one step further than object detection. After confirming the presence of the object on the frame, it returns the class of the object, based on a pre-labelled classification framework.

Planning out the project

Project Management

Project management is crucial for the success of any project. Collaborators in the project resided in different countries and continents, therefore it was vital to keep everyone involved by communicating responsibilities effectively, managing documentation, and monitoring project progress.

Omdena’s framework for Project Management makes sure that all collaborators can easily adapt to it. This means using publicly available tools that most collaborators would have a level of familiarity with. This approach saves valuable time and resources that would otherwise be spent on onboarding and training.

Three main tools were used for project management:

Slack – For effective communication.
Google Drive – For the organization and storage of project documents and files.
DagsHub – a Data Science platform for collaboration.

Datasets

Accessibility to a comprehensive dataset of microscopic images of microorganisms was very difficult. Our aim is to develop a deep-learning model which requires a relatively huge amount of data. However, detailed data in this domain is not publicly available for download and use.

After some research, we decided to use the Environmental microorganism image dataset (EMDS). Several versions of this dataset have been developed over the years and the most recent versions were made public. We used the Environmental Microorganism Image Dataset Sixth Version (EMDS-6) and Environmental Microorganism Image Dataset Seventh Version (EMDS-7).

EMDS-6 Dataset

Additional data

The EMDS – 6 & 7 datasets were great for the study, but the classes did not include the common pathogenic microorganisms tested for in drinking water. To solve this problem, we decided to search for data of common microorganisms found in drinking water, most of which are highlighted in WHO standards e.g., E-Coli, Salmonella, Shigella etc. The research team identified eleven new classes and populated them with images scraped from the web.

Pre-processing

Classification

The EMDS-7 dataset consisted of 41 classes of microorganisms, while EMDS-6 had 21. Our research team identified the pathogenicity of each class of microorganisms and marked it as 0 or 1 for non-pathogenic or pathogenic respectively.

Annotation

The annotation of EMDS-6 was done with the ‘Roboflow’ software. Roboflow is an efficient tool that streamlines the annotation process, saving time and ensuring accuracy and consistency of the annotated data.

Post-processing

For best practices, images should undergo some processing steps before model training. Additional pre-processing steps include verification of a balanced dataset in terms of classification, homogeneity in magnification across all classes, coloration, and hue, as well as image dimension.

Verification of balanced dataset: To ensure a fair model training to avoid the model from being biased to one class.
Colour and Hue: Uniformity in color improves the precision of the model.
Image dimension and resolution: Deep learning models train faster on smaller images since the algorithm learns from fewer pixels. 

Models and their Metrics

We developed two types of models:

1. Binary Classification Model

This task is focused on determining whether the water is contaminated or not. The dataset used for this task was balanced, containing nearly equal instances of both classes (harmful and not harmful).

2. Object Detection

This task is focused on locating and identifying the pathogens present in the water. Several model algorithms were used, but one challenge that we may face with most object detection models is the small size of microorganisms, making them difficult to detect.

This was addressed by using models that perform well at detecting small objects, such as Faster R-CNN, YOLOv8, EfficientNet, Detectron 2 and SSD.

Testing the Models

To find the best performing algorithm, six models were developed, three of which are for binary classification and three for object detection. These were tested in terms of their accuracy, memory, and interface time to assess their performance.

The models tested were:

YOLOv8-JCOLANO
YOLOv8m-cl
Rastogi’s object detection 1
Detectron2
Rastogi’s object detection 2
Rastogi image classification 1

The web application was developed using Streamlit and is currently deployed in Streamlit’s community cloud. Based on the testing we decided to use two models for binary classification and individual microorganism detection:

YOLOv8m-cls

Purpose: Binary classification for water contamination detection.
Usage: Users can determine if water is contaminated through binary classification.

Detectron2 Model

Purpose: Individual microorganism detection.
Usage: Users can detect and identify various microorganisms present in the provided images.

Constraints

Deployment on GitHub Repository

To deploy on Streamlit’s community cloud, the deployment code should be hosted on a GitHub repository separate from the Dagshub repository. At the time of creating this document, the code is currently hosted on a personal account and must be updated to an account associated with the San Jose chapter.

Large File Storage (LFS)

Since Detectron2’s model file size exceeds 100MB, Git Large File Storage (LFS) has been leveraged to manage and version large files efficiently. This can be made possible with the installation and configuration of Git LFS. Due note that each GitHub account receives 1 Gib of free storage and 1 Gib a month of free bandwidth.

Detectron2 Installation Handling

The automatic installation of necessary packages via the requirements.txt file poses an issue with Detectron2, particularly when installing it without a pre-built package. To address this, the latest pre-built packages for Detectron2 were used, which are built on torch 1.10. Since Yolo also relies on torch, and to prevent library conflicts, we maintain torch 1.10 throughout the system, limiting compatibility to Python 3.9.

Future Steps

Further Discussion on App Functionality and Hosting

Additional discussions are needed to determine the desired functionality and hosting location for the application. For now, a free hosting option such as Streamlit community cloud was leveraged; however, this limits the functionality that can be provided.

Add REST Capabilities

Consideration should be given to adding REST capabilities to the application for improved integration and external access.

Please note that these future steps require further deliberation and planning.

What We Learned

Challenges & Limitations

The original aim of the project is to create an app that can detect common pathogens in drinking water. There was no available dataset for some common pathogens such as Salmonella, E-coli and Shigella. Therefore, this app is a prototype that will undergo further development. Few things that also need to be addressed.

Better Datasets

Companies that have bacterial images will not make these images public – as these are time consuming and v. expensive to obtain. As a result, we have the following options:

1. Create environmental microorganism data – which is not easy to do.

2. Scour universities/research organizations to find if this is available.

Performance of model on field data

We need to understand what we can see under a paper microscope vs. a digital microscope and how the models perform under the two kinds of images.

User friendly deployment

We need to evaluate deployment methods and find out the best options. For example, the method of the app to be deployed on a mobile phone and possibly work offline.

Opportunities

A tool for researchers

The models and apps can be useful to researchers who may also seek its further development.

Potential as a community-based tool

One challenge of training the model with microorganisms with higher detection magnification limit, is the cost of microscopes that can detect these microorganisms. NGO’s or Governments can provide funds that will make this possible.

Time Frame

The entire project was completed in a 6 week period, between September and October 2023. In this time we were able to achieve all of the following:

Sourcing difficult to find datasets and processing them.
Testing and evaluating multiple AI models to find the correct one for this application.
Performing a thorough EDA to garner valuable insights.
Deploying the frontend as a publicly available demo.

Further Applications of This Technology

Water, especially drinking water, is becoming increasingly scarce in many parts of the world. Systems such as the one we built will be invaluable in making sure this precious resource remains unspoiled by disease causing microbes. Beyond this, there are many uses for our methodology and prototyping:

1. Medicine

Water-borne diseases are not always easily detected and diagnosed. Having a system that can test samples quickly can allow doctors to find the cause of a certain condition in a timely manner.

2. Agriculture

A lot of water-borne contaminants spread via the food grown with infected water. This tool can allow farmers to ensure that the water they are using is not contaminated.

3. Epidemiology

Many epidemics are caused due to a shared water source that becomes contaminated. Monitoring these sources can prevent them from happening.

Conclusion

This project highlights the high utility of AI based systems in ensuring water quality. It particularly underlines its ability to perform testing at scale while reducing the amount of investment required by governments and other organizations.

Our tool achieved a high level of accuracy and reliability in its working. However, it was limited by external factors, especially the availability of comprehensive datasets in this domain. Despite this, it proves as a valuable Proof of Concept for similar applications around the world that can achieve the level of robustness required for a real-world application if given access to the relevant resources.

We hope that this project can also serve as an inspiration for all whose mission is to achieve a world free of the specter of water-borne illnesses. To this end, we are making the resources from this project available for everyone to access. You may find these resources listed below.