The purpose of this article is to summarize the predictive modeling process, from exploring the data to deploying the prediction. This includes the thought process behind various decisions during the process, be it selecting the features that could help with better prediction accuracy or engineering certain features to help get more details out of the data. Hopefully, this document will serve as a one-shot walk-through for any beginners and/or advanced engineers looking to get a birds-eye view of the ML process and data life cycle.
Choose the dataset
We have worked on predicting the age of the Abalone dataset.
Why is it important to predict the age of an abalone? Abalone is considered a delicacy in Southeast Asia, and consumers are willing to pay a premium for high-quality abalone. The rarity of this seafood and its laborious harvesting process makes it expensive. The economic value of abalone is positively correlated with their respective ages and other physical characteristics. This type of information helps farmers and distributors to determine the market price of abalones. By counting the number of layers in its shell, one can determine the age of an abalone. It involves cutting a sample of the shell, staining it, and counting the number of rings through a microscope, which can be a tedious process.
Both Classification and Regression in Machine Learning are prediction processes. Classification will predict which group an observation belongs to, and Regression will predict an outcome for the observation. One can predict the age of Abalones from a study of their physical characteristics and age.
Which type of prediction is appropriate? Our outcome of interest is “age” which is a continuous variable. We need to consider if we have detailed and sufficient data to predict age. If not, should we predict a few classes of ages? Since the model will be used to put a dollar amount on the abalone for sale based on its age, ages’ granularity will matter. In this case, it makes sense to consider a Regression model instead of a classification model.
The data used in our case is from the UCI (University of California Irvine) Machine Learning Repository posted on Kaggle.
In this article, we will explore every step that is involved in the Data Science lifecycle.
We will start by loading the data from Amazon AWS. Post that, we will work on understanding the data and engineer it as required. We will then train a model to achieve our desired outcome. We will wrap up by talking about the process of model validation and deployment.
Data: Extract, Transform, Load
We placed the data set in an Amazon S3 bucket for illustrating access to the data from the cloud. Boto is an AWS SDK for Python that provides interfaces to AWS, including Amazon S3. Using boto3, it is easy to integrate your Python application, library, or script with AWS services, including Amazon S3.
To install boto3, run:(We ran it on google colab, you can run it on your local python instance.)
> pip install boto3
You will need an AWS key id and a secret_access_key, and a region name to access the data in your s3 bucket.
Diving into the data
Looking at the shape or dimension of the data frame.
df.shape will indicate (rows, columns) of your dataframe.
One of the most common data pre-processing steps is to check for null values, mislabeled values, or 0 values in the data set. Empty or NaN values can be imputed with mean, median, mode, or other values, relevant to the data at hand. For a time-series dataset, missing values can be replaced with the value before or after it. Values with 0 need to be investigated further in the context of the data. Was it mislabeled, or intentional? In the Abalone dataset there are a couple of zeros in the ‘Height’ column, which seems mislabeled.
If the amount of missing or unwanted characters in a row or column is statistically insignificant or won’t make a difference to the analysis in question, the rows can be dropped. We can drop a row or column with missing values using dropna() function.
Let us think, is it useful to keep all the features? And how do we understand what value each feature provides predicting our outcome(independent of other features)?
By computing the correlation one can assess how correlated the variables are and reduce or combine correlated features. ‘Whole weight’ in the dataset is highly correlated with ‘Shuck’, ‘Viscera’, and ‘Shell’ weight, and can be dropped.
Training models with a few features can also provide the importance of the features towards prediction. Compute resource and maintenance of data/features should be in mind when choosing feature numbers as well.
There are 9 features noted in the dataset, which are are sex, length, diameter, height, whole weight, shucked weight, viscera weight, and shell weight. The number of rings in abalone is directly related to its age. Our target variable for prediction is ‘Rings’.
Descriptive statistics provides a quantitative (using df.describe function) summary of the central tendency, dispersion, finding outliers, and shape of a dataset’s distribution. If zero values appear, like for Height(min) in our case, this would require further investigation into the data collection method and quality. Outliers can be real or they can be data entry errors. It is important to understand the source of the outliers before removing them. Height shows the max to be 1.13, which is in the upper 75% of the dataset.
A useful visual in identifying outliers with Interquartile Range (IQR) is a Box plot. The Height of the Abalone shows outliers(2) that lie way beyond the 75th Quartile. Since these are 2 values, we can remove them from the dataset. Other techniques such as Quantile-based Flooring and Capping can be used to deal with a larger number of outliers.
For categorical features, a useful Exploratory Data Analysis (EDA) step is the frequency distribution of categories within the feature, which can be done with the .value_counts(). The distribution of Males, Females, and Infants are not too different, Males are about 37%, Females and Infants about 32%.
To gain deeper insights, it helps to see Feature-Feature relationships and Feature-Target relationships. When looking at the length of the Abalone for Males, Females, and Infants, the length of Infant has a lower mean, as expected.
When the feature ‘Sex’ was plotted against the target variable ‘Rings’, observed that Infant has a max of 21 Rings. The std deviations of Rings for all the three genders are about the same. Infants can be considered in the same population as adults for model predictions.
Transformation is a change in form or appearance. Features are transformed for various reasons.
Features with different data types such as text or categorical are transformed so that they can be understood as numerical model input.
The input data sets may have large differences in their ranges, or they are measured in different units which may cause problems training some models e.g K-means. The features can be transformed to a comparable scale, e.g. by subtracting the mean and then scaling to unit variance, by taking the log or the exponential of a quantity, or a combination of them. Having a large number of features may be costly in terms of computing resources and maintenance. It is easier to plot and visualize with fewer features (3D or less). Dimension reduction techniques such as PCA are useful in such instances.
If features highly correlate with one another, it may be possible to find a combination of variables to reduce the number of features.
Let’s say we want to correlate the measured dimensions (height, length, and diameter) to the weight of the Abalone. For an abalone, the volume is proportional to the length times diameter times height. This approximate volume should correlate with the weight of the Abalone — the slope of volume (x) vs. weight (y) would have a slope equal to the density.
Dealing with Imbalanced Data
From the histogram of the target variable ‘Rings’(skew =1.113) we can see the ‘Rings’ distribution has a right skew. More than 91% of the data lie below 14 Rings, giving rise to the tail in the distribution.
There are various techniques for handling imbalanced data, some of which are noted below.
- Upsampling the minority class.
- Down Sampling the majority class.
- Re-Sampling — uses a mix of Up Sampling and Down Sampling.
- Using Synthetic Minority Over Sampling Technique or SMOTE — the minority class is oversampled by generating synthetic data using a similarity measure of features.
- Cost-Sensitive Training — which uses costs as a penalty for misclassification when the Learning algorithms are trained. The cost for misclassification is added to the error or used to weigh the error during the training process. E.g. Logistic Regression the class_weight parameter provides the cost-sensitive extension.
Choosing the right optimizing function has a big impact on the class imbalance problem, Say, if your decision tree is trained to optimize accuracy, it will handle class imbalance much worse than a decision tree that optimizes the F1 score or AUC. An example is using Random Forest with the weighted Gini (or Entropy) that takes into account the class imbalance. Or using balancing techniques such as a mixture of Up and Down Sampling of the classes when bagging decision trees.
Now that we have taken a look at data through data visualization. Let’s talk about selecting the features.
Feature selection is the process of selecting the best features for predictive modeling usually by reducing the number of features. Some datasets have many features that could be a deterrent in prediction, by finding out which features would give the best result and dropping those that wouldn’t give the best results we can try to get better accuracy.
The Abalone dataset has numerical Input, Numerical output. Input variables are numerical(float and int) and the output variable is also numerical which will be float. For Numerical Input Numerical Output we can do feature selection using Pearson’s correlation. The .corr() uses Pearson’s correlation by default. There is also Kendall, Spearman’s correlation coefficient, which is out of scope for this article.
Pearson’s coefficient correlation — A Pearson correlation is a number between -1 and +1 that indicates. to which extent 2 variables are linearly related. The Pearson coefficient is a type of correlation coefficient that represents the relationship between two variables that are measured on the same interval or ratio scale. The Pearson coefficient is a measure of the strength of the association between two continuous variables.
There are other ways to perform feature selection, but feature selection depends on the variable type whether numerical or categorical.
Correlation gives the statistical relationship between input variables and the target variable(our target variable is Rings). Values closer to +1 signify that the target value increases as the input variable increases and decreases with the input variable decreasing. Whereas numbers closer to -1 indicates that when the target value increases the input variable decreases and vice versa.
From the above image we see that Sex has a very low negative correlation, it looks like Sex will not have much value in the prediction of the output variable. To get a better correlation, the Sex column is dropped, and an IsAdult column based on the Sex column is created. The correlation with the new variable IsAdult is shown below.
Now, after adding isAdult column and dropping the ‘Sex’ column, we see a better correlation to the target variable ‘Rings’. With this we can conclude that we have a good set of features (when I say good the correlation coefficient is not too close to +1 or -1 which causes a skew when training ), to help with our model training.
Training the model
To train the model, let’s first split the data using train_test_split package from model selection.
To find out from many different algorithms which one would give the best prediction, we try to fit different classifiers and find out the cross-validation scores, the negative mean squared errors are used so the one with the highest negative mean squared error is the best amongst the algorithms chosen.
From the above image of the scores from the various algorithms, we know that Random Forest Regressor will give us good results. So now we are going to fine-tune the hyperparameters using GridSearchCV. GridSearchCV automatically tunes the hyperparameters with the parameters specified to find the best parameters and the best estimator, this helps us from manually having to tune, which would take a lot of time.
The GridSearchCV gives us the best estimator,
Which we can now use to predict the test values.
It is time to evaluate the model.
We have now trained the data, we need to find out if our model generalizes well with unseen data. We need to find out if our model works and if we can trust the predictions. For which we need to evaluate the model.
There are many ways to evaluate the model. I am going to use Root Mean Square Error measure accuracy, for the regression problem Mean squared error loss will be able to give better prediction accuracy.
Model Validation is the process of checking the behavior of a model before it is used in production. Different organizations and Data Scientists might employ different methods to validate the model that they have built. The steps involved will vary across both horizontals and verticals to ensure that the model performs as expected and suits the specific needs of the team in question. Some of the common steps that are involved during the validation are as follows:
1. Conceptual Design — In this phase, the underlying principle behind the model development is questioned and verified. For a financial institution, this step may involve ensuring that the granting of loans is not subject to racial bias while in Human Resource tools, analytical models might want to ensure that the model does not prefer men, more than women.
2. Data Quality Assessment — In this phase, the quality of the data going into the model as training data is checked and analyzed. This will normally involve the following steps:
- Outlier Detection
- Data Imbalance
- Noise Removal
- Data Integrity Checks
- Diversity Checks (Checking if people are from all parts of the society are represented equally)
3. Model Performance Assessment — This step deals with evaluating the performance of the model by performing various checks. The model might be run again on specific sets of data and the results that are output by the model will be compared and analyzed. If the model is churning out results that are inconsistent, then the model won’t go into production and could be sent back for further understanding of issues in the model, if any
4. Infrastructure Assessment — This step deals with evaluating the infrastructure that the model will be deployed to. The infrastructure will be evaluated differently, based on the business needs of the model and the type of the model. For the unversed, there are primarily 2 types of models — online models and batch models. Online models serve real-time requests and provide predictions in real-time while for batch models, the predictions are made as a ‘batch’, at one go. For such models, there might be scheduled runs of the model where the data is aggregated and fed into the model at one go.
- On-line models: For these models, the infrastructure should be available at all times to serve requests. The latency should be at a bare minimum and any downtime could cause complications from a business standpoint. Hence, for such models, there will usually be back-ups that can take over at all times to reduce downtime
- Batch models: For these models, the infrastructure should have enough computing power to be able to make predictions in a timely manner. If the infrastructure doesn’t have enough computing power to handle the volume of data that will be fed into it, then either the infrastructure should get upgraded for more computing power or the model will have to be optimized more to adapt to the constraints in the infrastructure.
As can be seen, the model validation is a series of steps that involve checks from both the business side and the technical side of things. On the technical side, care should always be taken to ensure that the model should only be validated on data that was not used to train it, ie. training data.
Model, not a one-time process and is instead a series of steps that should be repeated whenever modifications are made to a model. The above steps will vary from organization to organization to adapt to individual needs but these are the most common steps that need to be included in any pipeline of steps for validating a model.
Software products are deployed to a server when the development process is complete. Similarly, models once completed and validated also need to be deployed. While most organizations use cloud services like AWS, GCP, or Azure, there are organizations that will have their own in-house deployment space. There are also certain end-to-end tools, which lets you develop the model inside the tool and later provide solutions for its deployment.
Once the model is deployed, it is important to keep an eye out for its performance. Metrics, such as latency of requests need to be continually computed along with ITV(In-time Validation) and OTV(Out-time Validation) so that the model performs as expected at all times.
Another important characteristic is the dataset shift. Dataset shift occurs when the training and test sets are composed of different distributions. Dataset shift is a common occurrence for production-ized models as there could be intricate changes in the characteristics of the data that is being fed into the model, compared to the data that was used to train it.
In this article, we have tried to cover the prediction pipeline from data evaluation, data visualization, feature selection to deployment. Every step in the life cycle is important and none of them can be ignored. If the data is not properly cleaned, then the model won’t perform up to expectations. The model needs to be tuned to achieve the best possible predictions. After the model has been developed, it is now equally important to validate its performance and then push to Production while providing adequate infrastructural capacity. No matter how intricately all the different phases in the life cycle have been worked on, if one single step is not carried out or rushed through, it could spoil the entire effort.
The abalone dataset was predicted for age from a Regression Model as it is of business value to consider the granularity of age in pricing the Abalone.
In the ETL pipeline, various plots such as box plots, histograms, correlation matrix to study features, their relationship, and their balance in the data visualize the Data. Some methods to transform features and handle data imbalance are noted, though not explored. It is critical to understand the data’s limitations (e.g., imbalance) in choosing the right metric to check model performance.
Then we moved on to selecting the features, feature selection helps in figuring out the features that would give good accuracy, we used Pearson’s correlation. Then we trained the dataset using Random Forest Regression after testing out many different algorithms. After which we evaluated the model using Mean Square error. After which we do a model validation and a model deployment.