How AI Can Protect Our Water: Detecting The Invisible Threats Within
Omdena harnesses AI to reveal hidden water contaminants, empowering communities with innovative, low-cost, and scalable water protection.
April 5, 2024
12 minutes read

Omdena’s AI-powered tool detects harmful microorganisms in drinking water using deep learning models like YOLOv8 and Detectron2. The low-cost system enables faster, accurate testing, helping communities and organizations ensure clean, safe water worldwide.
The Problem
Water is a source of life, and its quantity and quality are of utmost importance to human life. The United States and many other countries are committed to providing clean and safe water for most of its residents, but water borne diseases continue to be a challenge.
Reports from the CDC confirm that about 7.5 million water borne illnesses occur annually and approximately $3.3 billion in healthcare costs (in the USA)
This means that these lives are continuously endangered – because contaminated water is tightly connected with the transmission of diseases such as cholera, diarrhea, type-A hepatitis, typhoid fever, dysentery, and polio.
The main causes of these diseases are microorganisms, viruses, and fecal matter in the drinking water because of aging infrastructure, chlorine-resistant pathogens and an increase in recreational water use. While water quality monitoring systems exist as part of a framework for many water infrastructures, several challenges such as timely availability of the results, reliability of the data, performance of existing systems, robustness of the system, and interoperability of the results to the end users, need to be improved on.

The Background

Each year, 4 billion cases of water-related diseases lead to 3.4 million deaths worldwide, making them one of the leading causes of mortality among children under five. The situation is even more severe in rural regions of developing countries, where access to clean water and sanitation is often limited.
Both developed and developing nations have established governmental and private systems to monitor drinking water quality and reduce the risks of waterborne illnesses. These systems typically rely on testing water samples from different sources to detect disease-causing microorganisms and other contaminants.
However, implementing such testing at scale demands significant investment and resources, which can limit the reach and efficiency of existing infrastructure.
The Goal
This project aimed to develop a low-cost, efficient method for detecting microorganisms (bacteria) in drinking water, helping to reduce the time required for water quality testing. It was designed to identify most of the microorganisms commonly found in drinking water while remaining user-friendly and practical at a local level.
The detection method includes a binary classification system that categorizes microorganisms as either harmful or not harmful. This approach ensures accessibility for users who may not have the technical knowledge to determine the safety of microorganisms by name alone.
The four main objectives of this project were:
-
Develop a low-cost, easy-to-use method for detecting microorganisms (bacteria) in drinking water.
-
Access and prepare suitable data of common microorganisms in water to train a deep learning model.
-
Train a CNN (or equivalent) to recognize and classify bacteria using computer vision techniques.
-
Deploy the trained model on a mobile device for real-world usability.
Our Approach
Coming up with the Design Concept
The goal was to develop an object detection deep learning model capable of identifying the presence of microorganisms in water samples. Since this method relies heavily on shape differentiation, it was ideal for this study—microorganisms typically vary in shape at least at the genus level, making visual detection feasible.
Object detection and classification are widely used techniques in computer vision. Object detection uses images, videos, or live camera feeds to recognize and locate objects within a frame in real time.
Object classification takes this process a step further. Once the object’s presence is confirmed, the system determines its class based on a predefined labeled dataset, enabling more precise categorization and analysis.
Planning out the project

Project Management
Good project management was key to keeping everything on track. Since our collaborators were spread across several countries and continents, clear communication and coordination were essential. The team needed a way to share responsibilities, manage documentation, and monitor progress effectively.
Omdena’s project management framework made this process smooth and inclusive. It focuses on using simple, publicly available tools that most contributors are already familiar with. This approach saves valuable time and avoids the long onboarding or training that many international collaborations require.
We used three main tools to manage the project:
-
Slack – for day-to-day communication and quick updates
-
Google Drive – for organizing and storing project files and documents
-
DagsHub – for data science collaboration and version control
Datasets
Finding a comprehensive dataset of microscopic images of microorganisms turned out to be one of the biggest challenges in this project. Building a deep learning model requires a large amount of data, yet detailed datasets in this field are rarely available to the public.
After some research, the team decided to use the Environmental Microorganism Image Dataset (EMDS). Over the years, several versions of this dataset have been released, and we used the two most recent ones — EMDS-6 and EMDS-7 — as our primary data sources.
EMDS-6 Dataset and Additional Data
While the EMDS-6 and EMDS-7 datasets were valuable, they didn’t include many of the common pathogenic microorganisms found in drinking water. To bridge that gap, the team searched for additional data, focusing on pathogens highlighted in WHO standards, such as E. coli, Salmonella, and Shigella.
In total, eleven new microorganism classes were identified and populated with images collected from reliable web sources, strengthening the dataset and making it more relevant to real-world water testing scenarios.
Pre-processing
Classification
The EMDS-7 dataset contained 41 classes of microorganisms, while EMDS-6 included 21. The research team reviewed each microorganism and labeled it as pathogenic (1) or non-pathogenic (0) based on its potential to cause disease.
Annotation

To label the data efficiently, the team used Roboflow, a tool that streamlines the annotation process and ensures accuracy and consistency across the dataset. This helped save time while maintaining high-quality data for model training.
Post-processing
Before training, the images went through several additional processing steps to improve model performance. These included checking for balanced classes, consistent magnification levels, uniform color and hue, and standardized image dimensions.
-
Balanced dataset: Ensures fair model training and prevents bias toward one class.
-
Color and hue consistency: Uniform coloration improves the precision and reliability of predictions.
-
Image dimension and resolution: Smaller, standardized images help the model train faster by reducing the number of pixels it needs to process.
Models and Their Metrics
The team developed two main types of models for this project:
1. Binary Classification Model
This model focuses on determining whether a water sample is contaminated or not. The dataset used for this task was carefully balanced, containing nearly equal numbers of samples from both classes — harmful and not harmful.
2. Object Detection
This model identifies and locates pathogens within water samples. Detecting microorganisms can be tricky since they are extremely small, so the team used algorithms known for their ability to handle small-object detection, including Faster R-CNN, YOLOv8, EfficientNet, Detectron2, and SSD.
This was addressed by using models that perform well at detecting small objects, such as Faster R-CNN, YOLOv8, EfficientNet, Detectron 2 and SSD.
Testing the Models
To identify the best-performing algorithm, the team developed and tested six different models — three focused on binary classification and three on object detection. Each model was evaluated for accuracy, memory usage, and inference time to determine its overall performance.
The models tested were:
- YOLOv8-JCOLANO
- YOLOv8m-cl
- Rastogi’s object detection 1
- Detectron2
- Rastogi’s object detection 2
- Rastogi image classification 1
The web application was developed using Streamlit and is currently deployed in Streamlit’s community cloud. Based on the testing we decided to use two models for binary classification and individual microorganism detection:
YOLOv8m-cls
- Purpose: Binary classification for water contamination detection.
- Usage: Users can determine if water is contaminated through binary classification.
Detectron2 Model
- Purpose: Individual microorganism detection.
- Usage: Users can detect and identify various microorganisms present in the provided images.
Constraints
Deployment on GitHub Repository
To deploy the application on Streamlit’s Community Cloud, the code needed to be hosted on a GitHub repository separate from the DagsHub repository. At the time of documentation, the deployment code was hosted on a personal account and needed to be transferred to an account associated with the San Jose chapter.
Large File Storage (LFS)
Since the Detectron2 model file exceeded 100MB, the team used Git Large File Storage (LFS) to handle and version large files efficiently. This setup requires installing and configuring Git LFS. Each GitHub account provides 1 GiB of free storage and 1 GiB of monthly bandwidth, which helps manage large assets within these limits.
Detectron2 Installation Handling
The automatic installation of necessary packages via the requirements.txt file poses an issue with Detectron2, particularly when installing it without a pre-built package. To address this, the latest pre-built packages for Detectron2 were used, which are built on torch 1.10. Since Yolo also relies on torch, and to prevent library conflicts, we maintain torch 1.10 throughout the system, limiting compatibility to Python 3.9.
Future Steps
Further Discussion on App Functionality and Hosting
The team plans to hold additional discussions to decide on the desired functionality and the best hosting options for the application. For now, the project uses the Streamlit Community Cloud as a free hosting platform. While this setup works for demonstration purposes, it limits the overall functionality and scalability of the application.
Add REST Capabilities
There are also plans to add REST API capabilities to the application, which would make integration with other systems easier and enable external access to its features. These next steps will require more detailed planning and collaboration to ensure the project continues to grow and serve broader use cases.
What We Learned
Challenges & Limitations
The main goal of the project was to create an app capable of detecting common pathogens in drinking water. However, one of the biggest challenges was the lack of publicly available datasets for some of the most common bacteria, such as Salmonella, E. coli, and Shigella. Because of this, the current version serves as a prototype that will continue to evolve as more data becomes available.
Better Datasets
Companies that own bacterial image data rarely make it public, as collecting and labeling these images is both time-consuming and expensive. To move forward, the team identified two possible options:
-
Create new environmental microorganism datasets — though this would be complex and resource-intensive.
-
Partner with universities or research institutions that may already have relevant datasets available.
Performance of model on field data
Further testing is needed to evaluate how well the model performs under real-world conditions — specifically, comparing results from paper microscopes versus digital microscopes to understand how image quality affects detection accuracy.
User friendly deployment
The deployment approach also needs refinement. The next step is to identify the best method to make the app more accessible — ideally enabling it to run on mobile devices and work offline, making it more practical for field use in remote areas.
Opportunities
A tool for researchers
The models and applications developed through this project can be valuable resources for researchers who wish to build upon or expand the work. With further development, these tools could support advanced studies in microbiology, environmental science, and public health.
Potential as a community-based tool
There’s also strong potential for this system to become a community-level tool. One of the main challenges in detecting microorganisms that require high magnification is the cost of microscopes capable of identifying them. With proper support from NGOs or government initiatives, funding could help make such equipment more accessible. This would allow communities to test their own water sources quickly and cost-effectively, empowering local efforts to improve water safety. In parallel, many innovators and organizations are making significant progress in global water purification and management. Learn more about the top water treatment companies that are leading the way toward cleaner, safer water systems worldwide.
Time Frame
The entire project was completed in a 6 week period, between September and October 2023. In this time we were able to achieve all of the following:
- Sourcing difficult to find datasets and processing them.
- Testing and evaluating multiple AI models to find the correct one for this application.
- Performing a thorough EDA to garner valuable insights.
- Deploying the frontend as a publicly available demo.
Further Applications of This Technology

Access to clean drinking water is becoming increasingly scarce in many parts of the world. The system developed through this project can play a vital role in ensuring that this precious resource remains free from disease-causing microorganisms.
Beyond water quality testing, this technology and its underlying methodology have potential uses in several other fields:
1. Medicine
Waterborne diseases are not always easy to detect or diagnose. A quick, reliable testing system can help doctors identify the cause of illnesses faster and begin treatment sooner.
2. Agriculture
Conntaminated water can easily spread through crops grown with infected water sources. This tool can help farmers ensure that the water they use for irrigation is clean and safe.
3. Epidemiology
Many outbreaks originate from contaminated shared water sources. Regular monitoring with systems like this can help detect contamination early and prevent widespread epidemics.
Conclusion
This project demonstrates the powerful role that AI-based systems can play in maintaining water quality. It shows how technology can help conduct large-scale testing efficiently while reducing the financial and operational burden on governments and organizations.
The tool achieved a high level of accuracy and reliability, though it was limited by factors such as the availability of comprehensive datasets. Even with these challenges, it stands as a strong proof of concept, highlighting what’s possible when AI is applied to public health and environmental monitoring. With the right data and resources, similar models could be developed into robust, real-world solutions.
We also hope this project serves as an inspiration for others working toward a world free from waterborne diseases. To support collaboration and progress in this area, we’ve made all project resources publicly available for anyone interested in exploring or building upon our work.




