Analyze Pipe-Borne Water Availability in Lagos, Nigeria using Machine Learning
November 17, 2022
Introduction/Abstract
The lack of quality water supply has long been a severe problem in Lagos, Nigeria. Although the state has had numerous major and minor operational waterworks for over a century, recent studies indicate that more than 80% of Lagosians lack access to public water supplies. It has also been noted that rural areas experience this deficiency to a significant degree since they lack the finances to access private organizations that provide these services due to low income and low economic growth.
The Omdena Lagos Nigeria team aimed to address this dilemma by researching to illuminate and identify low, middle, and high-income communities and their respective access to water supply utilizing various Data Science and Analysis tools and techniques and complex Artificial Intelligence and Machine Learning algorithms. Furthermore, this project raises profound awareness for communities who need the most attention when the Government or NGOs decide to take action to address this critical issue.
Our output consists of an interactive dashboard that presents essential information on water availability, machine learning algorithms that categorize water demand and supply, and integration of water sites based on obtained data on the map of Lagos, Nigeria. Hopefully, this paper will be beneficial to educate readers on the issue and pave the way for further debates on possible solutions.
Data collection
Written by: Olamide Goriola
Data collection is the first and most crucial step in any project because the output depends entirely on the data input type.
We believe that this project can pave the way for critical findings and gathering vital data needed to address this issue. One major problem of this project was the availability of data required. The project scope had to be re-evaluated at every step depending on the data we could get.
The team held several discussions to determine what type of data was required, as well as to clearly define the goal of the research, which is to find the location of various water sources in Lagos, Nigeria, correlate it with the population, income classification based on low/middle/high, settlement classification based on rural/semi-urban/urban and their geographical coordinates in each local government, calculate the water supply and demand ratio and highlight the geographical areas who mostly suffers lack of water supplies.
To accomplish this, we began looking for information on various water sources scattered throughout the Lagos geography. We obtained a public data set (Fig 1.0) containing waterworks information for all states in Nigeria with variables like the source, longitude, latitude, water tech, facility type, installation year, management, status, etc. we extracted data for Lagos state.
After retrieving data relating to waterworks, we needed to get other kinds of information to achieve our goal. We obtained data on the population, income, and settlement classification of local governments in Lagos, Nigeria, in Fig 1.1 and 1.2 below.
This valuable data that we managed to gather motivated the team to move on to the next stage.
Data Pre-processing
Written by: Rashi Shankar
Data preprocessing is a data mining technique that converts raw data into a useful and efficient format. After gathering the data, our team moved on to the next step, data pre-processing. During the data collection process, we sketched out and finalized four datasets. Our team then worked on the datasets and captured the preliminary information necessary for our project.
Our first step was to filter the dataset related only to Lagos and eliminate all the blank spaces in the dataset. In the below figure, we have the data which still needed to be processed.
The columns with repeated values were removed. Because the problem statement was about water, we considered the population of people living according to the data gathered and included a population column in the processed dataset. We also combined the average income details in the excel file with waterworks data to gain more valuable and accurate insights into the dataset. The final dataset had 29 columns with a total of 1506 entries in each column.
Feature Engineering
Written by: Dawood Wasif & Lakshmi Madhurya Bolla
The preprocessed data contained a total of 29 columns, and each column had a total of 1506 entries. The feature engineering task was divided into eight further steps, with each phase having its specific purpose. The subdivision of the tasks is listed as
1. Imputation
Imputation refers to replacing missing data (null values) with substituted values. 7 out of 29 columns had missing values, accumulating to 1514 null values. The distribution of missing values can be observed in Fig 3.1.
As we can also observe in Fig 3.1, all the columns with missing values had categorical data. Thus, the method of imputation selected was ‘Mode’ which can replace the missing values with the most frequently occurring data value in place of null values. After imputation, the dataset was devoid of null values.
2. Feature Transformations
Feature transformation is a mathematical transformation in which we apply a mathematical formula to a specific column/feature and transform the values to make the specific column more understandable and reasonable for our model. Three columns/features were transformed in total as mentioned below:
a. Converted report date to the number of days passed:
The report date feature is not understandable to the model; hence, it was converted to the number of days passed by subtracting the report date from the current date.
b. Fix spelling mistakes in water tech
The water tech column had spelling mistakes, such as ‘pipe borne water was spelled ‘pipe boren water’ or ‘pipe pone water’, due to manual data entry. Such similar mistakes were fixed for other values, too.
c. Converted latitude and longitude to distance and angle (polar coordinates)
For specific models, using the Cartesian coordinates (latitude and longitude) to model the target variable’s dependency on geographical coordinates properly can require overly complex models. A common approach is transforming the coordinates into polar coordinates and adding them as new features. That way, a tree will require fewer splits to model this spatial dependency of samples and train more efficiently.
3. One hot encoding
Many machine learning algorithms cannot operate on label data directly, which requires all input variables and output variables to be numeric. This is primarily a constraint of the efficient implementation of machine learning algorithms rather than rigid limitations on the algorithms themselves. This means that categorical data must be converted to a numerical form. For this purpose, one hot encoding was used to encode all categorical columns, including the target variable, i.e., water demand to a numeric value. The categorical variables were stored in a dictionary for future use as a list value, and the list index was assigned as the numeric label for the value.
4. Outlier detection & correction
Outliers are nothing but data points that are too far from most data points. These outliers can cause skewness in the distribution of data. It would further cause inaccuracy in fitting models and predicting the outcomes. These outliers can be identified by visualizing the numerical data of the data frame using box plots. Box plots are majorly used as statistical plots that give five summary statistics in a single plot, including minimum, first interquartile point, median, third interquartile point, and maximum values. The outliers are the points that lie out of these box and whiskers plots and can be either lower than
a. Detected numerical outliers:
Outliers of the numerical data have been recognized by plotting the box plots using the function of the visualization library Seaborn.
b. Adjusted outliers by using median inter-quartile ranges, i.e., converting outliers in between inter-quartile ranges:
These outliers that are recognized have been replaced with the values that lie between the minimum and maximum values of interquartile ranges and have been plotted again with the box plot function of the seaborn library to see if the outliers are removed, as shown in the Fig 3.3 below:
5. Log Transform
Log transformation is applied mainly because of skewed distribution. Logarithm naturally reduces the dynamic range of a variable, so the differences are preserved while the scale is not that dramatically skewed. Thus we applied log transformation to all numerical columns to remove skewness in data.
6. Scaling
Scaling is necessary to transform data of the columns to a standard scale so that the model works accurately by predicting the values rather than giving utmost importance to specific columns with higher values that either underfit or overfit the model.
a. Applied standardization by calculating z scores using column mean and variance:
Standardization was initially implemented on the columns of the dataframe. Still, the resulting columns were negative and were useless when we wanted to predict the top 10 best features using chi-square.
b. Instead, later on, used the min max scaler to avoid negative values:
Using the min max scaler transformed the values of all columns between 0 and 1 and would make the machine learning model best to estimate the water demand.
7. Feature Selection
Feature selection is similar to Dimension Reductionality but helps in identifying the features that would be more correlated to the target variable Y.
In this process, the data frame is initially split into independent variable x, which contains all independent features, and target variable y, i.e., the water_deman column. Using the Variance Threshold, we have reduced the dimension of the data frame from x variable of shape (1506,23) to shape of (1506, 21).
a. Selected the top 10 best features with bestKSector algorithm by calculating chi-square values:
We had 21 columns in the independent variable x, which were not directly related. Some models would significantly reduce the model to simple ones and create the best columns with the least significance or impact on the target variable; hence, considering all the 21 columns to train the model would overfit the model, and the predictions would go wrong. Thus, we took only the top 10 features into account for creating a prediction model. For this reason, we considered using the SelectKBest function in conjunction with chi2 score values of feature selection between x and y values to identify the best columns predicting the target variable.
8. Feature Reduction
In machine learning and statistics, dimensionality/feature reduction is the process of reducing the number of random variables under consideration, and this can also be referred to as feature extraction. For this purpose, we used Threshold Variance for Dimension Reductionality and dropped columns with almost 99% values equal by taking threshold = 0.01. Moreover, we used PCA and LDA to reduce and visualize features and fitted an unsupervised k means clustering model to visualize the reduction effectiveness.
LDA is similar to PCA; both look for linear combinations of the features to explain the data. Linear discriminant analysis is a supervised dimensionality reduction technique that achieves the classification of the data simultaneously. The principal component analysis is an unsupervised Dimensionality Reduction technique, ignoring the class label. We visualized correlation maps for the reduced/extracted set of features.
Data Visualization
Written by: Souvik Roy & Shail Raj Mishra
The visualization was done by GIS mapping, which was plotted with Mapbox. It visualized the top 10 parameters that were found from the feature engineering part on an open street map to analyze area by area. The first step was to import the libraries, read the data, and extract meaningful insights like the number of columns, etc. The second step was to do data cleaning in which there were repeated values in different forms like Pipe borne water, pipe borne water, etc., These repeated values were giving two values instead of giving one.. The screenshot of the code to clean the data is attached below:-
Then the next step was to use Mapbox API, which came in plotly, and the user had to generate an API token from the Mapbox site to move further for interactive visualizations. He screenshot of the final visualization is attached below:-
After Gathering the top 10 features of the preprocessed dataset, the next stage of data visualization was done, and a Power BI dashboard was made on the top 10 columns. Namely, population, income, Classification of the population as Urban/rural/semi-urban, demand for water, status, water technology used, and water quality.
Some Insights that were gathered are as follows:
- Lagos has a population of about 21 million, and out of the total water supplied, 80% of water is acceptable.
- The Rural Population, which is 4.4 million, consists of a low-income population and is supplied with water mainly by the public tap, with 11.3% of water not up to acceptable quality.
- Semi-Urban Population, which is 16.6 million, consists of a Middle-income population and is supplied with water mainly with a motorized system, with 13.3% of water not up to the acceptable quality.
- Urban Population, which is 16.6 million, consists of a high-income population and is supplied with water mainly with motorized systems and pipe-borne water, with 17.8% of water not up to acceptable quality.
- Water quality that is not acceptable because of bad color and taste accounts for about 13.1 % of the total water supply in Lagos, Nigeria.
- Spring water is supplied to about 50 million people, of which 1.39% of water is bad because of taste and may cause some diseases.
- Alimosho state, which has the semi-urban classification, supplies 18.46% of its water through wells, and approximately 16.27% of it is of acceptable quality.
- Around 750,000 people live around Mushin, which has the lowest water demand. It accounts for about 2% of the water supply, but the main thing to look at is that in this 2%, around 0.1% of water is not acceptable.
- Around 600,000 people live around Aiyetoro, Agege, and Ikotun and have the highest water demand, which accounts for about 16.27% of water supplied, but 2.5% of water is not acceptable because of Taste.
Machine Learning/ Integration/Deployment
Written by: Ruqayyah Amzat
After a successful feature engineering phase, it was time to make a machine learning model. Some of our team members stated that the task should be a regressive water demand prediction per area. After much deliberation, the team decided that the job be a classification prediction where the areas classes are High, Medium, and Low water demand regions.
- The feature columns for the machine learning are the ‘Classification as urban/semi-urban/rural_one-hot’, ‘log_Population’, ‘water_source_category_one-hot’, ‘log_angle’, ‘status_id_one-hot’, ‘log_days_passed’, ‘adm2_one-hot’, ‘log_dist’, ‘pay_one-hot’, ‘water_tech_category_one-hot’.
- The target column is the target priority based on the water demand per area.
Eighty percent of the data was the set amount for training the machine learning model.
The Auto sklearn is an auto-ML model, and the machine learning pipeline includes data preprocessing, feature preprocessing, hyperparameter optimization, model selection, and evaluation.
The model training utilized an auto-sklearn classification model as it leverages advancement in optimization and ensemble construction and simplifies the pipeline for retraining; we ended up with the best-performing algorithm.
Hence, here is auto sklearn modeling for the Lagos pipe-borne water modeling:
In the figure above, the training loss of the algorithms tested out by the model all low. The report of the algorithm is viewed in the leaderboard, and the individual hyperparameter tested out by the auto sklearn can be viewed with the show model and the ranking.
The model was then tested on the test set and yielded a high accuracy score of 1.0 and high precision and recall.
The model and its weight were saved as a pickle file and will be used for future prediction when deployed.
The model deployment uses Gradio. The modeling results classify areas into Low, Medium, and High regions based on their water demand. The snapshot of the application is shown below:
The user can give inputs and check whether the target area is classified as low, mid, or high.
Demo
Conclusion
Finally, the team critically analyzed the available dataset and provided significant insights into this problem. We hope to conduct additional research soon because we know that the more research is done, the more insights and solutions to this problem will be discovered.
Find more information about the project page here