Steps towards building an ethical credit scoring AI system for individuals without a previous bank account.
With traditional credit scoring system, it is essential to have a bank account and have regular transactions, but there are a few groups of people especially in developing nations that still do not have a bank account for a variety of reasons; they do not see the need for it, some are unable to produce the necessary documents, for some the cost of opening the accounts is high, some may not have the knowledge about opening accounts, lack of awareness, trust issues and some unemployed.
Some of these individuals may need loans for essentials; maybe to start a business or like farmers who need a loan to buy fertilizers or seeds. While many of them may be reliable creditors but because they do not get access to funding, they are being pushed to take out high-cost loans from non-traditional, often predatory lenders.
Low-income individuals have an aptitude for managing their personal finances. And we need a system for ethical credit scoring AI in order to help these borrowers and clutch them from falling into deeper debts.
Omdena partnered with Creedix to build an ethical AI-based credit scoring system so that people get access to fair and transparent credit.
The problem statement
The goal was to determine the creditworthiness of an un-banked customer with alternate and traditional credit scoring data and methods. The data was focused on Indonesia but the following approach is applicable in other countries.
It was a challenging project and I believe everyone should be eligible for a loan for essential business ventures but they should be able to pay it back while not having to pay exorbitant interest rates. Finding that balance was crucial for our project.
Three datasets were given to us,
Information on transactions made by different account numbers, the region, mode of transaction, etc.
2) Per capita income per area
All the data is privacy law compliant.
3) Job title of the account numbers
All data given to us was anonymous as privacy was imperative and not an afterthought.
Going through the data we understood we had to use unsupervised learning since the data was not labeled.
Some of us were comparing online available data sets to the data set we had at hand, and some of us started working on sequence analysis and clustering to find anomalous patterns of behavior. Early on, we measured results with silhouette score — a heuristic tool to figure out if the parameters we had would provide significant clusters. The best value is 1 with well separable clusters, and the worst is -1 with strongly overlapping ones. We got average values close to 0s, and these results were not satisfactory to us.
With the given data we performed feature engineering. We calculated per ca-pita income score and segregated management roles from other roles. We also calculated the per capita income score so that we can place buckets into accounts in areas that are likely to be reliable customers. For example. management roles mean they would have a better income to pay back.
But even with all the feature engineering, we were unable to get a signal from the data given for clustering. How did we proceed?
We scraped data online from different sites like indeed and numbeo. Since we had these challenges we were not able to give one solution to the customer and had to improvise to provide a plan for future analysis, so we used dummy data.
We scraped data from sites like numbeo to get the cost of living per area, how much they spend on living. From indeed we got salary data to assign an average salary to the jobs.
With the data, scraped online and feature engineering from the given dataset, we tried to figure out if we can get a prediction from using clustering algorithms.
- Engineered Features & Clusters (from Datasets given)
- Machine Learning Pipelines/Toolkit (for Datasets not provided)
- Unsupervised Learning Pipeline
- Supervised Learning Pipeline (TPOT/auto-sklearn)
1. Engineered features & clusters
- Dataset (with new features): https://github.com/omdena/banking_unbanked/blob/feat-engineering/data/CIF_feat.csv
- Notebook (that generates the features): https://github.com/omdena/banking_unbanked/blob/feat-engineering/Feat_gen_per_customer.ipynb
As mentioned above, with the context that we have gathered from Creedix, we have engineered or aggregated many features based on the transaction time series dataset. Although these features describe each customer better, we can only guess the importance of each feature with regards to each customer’s credit score based on our research. So, we have consolidated features for each customer based on our research on credit scoring AI. As for the importance of each feature with regards to credit scoring AI in Indonesia, this will be up to the Creedix team to decide.
CreditScore = 7*Salary + 0.5*Zakut + 4000*Feature1 + …+ 5000*Feature6
Solutions given to Creedix were both Supervised Learning and Unsupervised Learning. Even after all the feature engineering and data found online we were still getting a low silhouette score signifying that there would be overlapping clusters.
So we decided that we will provide solutions for Supervised Learning using Auto ML and Unsupervised learning, both using dummy variables, the purpose -was to serve future analysis or future modeling for the Creedix Team.
The dataset we used for Supervised Learning — https://www.kaggle.com/c/GiveMeSomeCredit/data
With Supervised Learning, we did modeling with both TPOT and Auto SKLearn. This was done so that when we have more features available that are accessible to them but may not be for Omdena collaborators they can use the information to build their models. When they have target variables to use.
2. The model pipeline for Supervised Learning
Our idea is to create a script that can take any datasets and automatically search for the best algorithm by iterating through all classifiers/regressors, hyperparameters based on user-defined metrics.
Our initial approach was to code from scratch iterating individual algorithms from packages (e.g. sklearn, XGBoost and LightGBM) but then we came across Auto ML packages that already do what we wanted to build. Thus, we decided to use those readily available packages instead and not spend time reinventing the wheel.
We used two different auto ML packages TPOT and Auto-sklearn. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines and finding the best one for your data.
Auto-sklearn frees an ML user from algorithm selection and hyperparameter tuning. It leverages recent advances:
- Bayesian optimization
- Ensemble construction
Both TPOT and auto-sklearn are similar, but TPOT stands out between the two due to its reproducibility. TPOT is able to generate both the model and also its python script to reproduce the model.
3. Unsupervised Learning
In the beginning, we used agglomerative clustering (a form of hierarchical clustering) since the preprocessed dataset contains a mix of continuous and categorical variables. As we have generated many features from the dataset (some of them very similar ones, based on small variations in their definition), first we had to eliminate most of the correlated ones. Without this, the algorithm would struggle to find the optimal number of groupings. After this task, we remained with the following groups of features:
- counts the number of transactions per month (cpma),
- average increase/decrease in value of specific transactions (delta),
- average monthly specific transaction amount (monthly amount),
and three single specific features:
- Is Management — assumed managerial role,
- Potential overspend — value estimating assumed monthly salary versus expenses form the dataset,
- Spend compare — how customer’s spending (including cash withdrawals) differs from average spending within similar job titles.
In a range of potential clusters from 2 to 14, the average silhouette score was best with 8 clusters — 0.1027. The customer data was sliced into 2 large groups and 6 much smaller ones, which was what we were looking for (smaller groups could be considered anomalous):
This was not a satisfactory result, anyway. On practical grounds, describing clusters 3 to 8 proved challenging, which is correspondent with a relatively low clustering score.
It has to be remembered that the prime reason for clustering was to find reasonably small and describable anomalous groupings of customers.
We, therefore, decided to apply an algorithm that is efficient with handling outliers within a dataset — DBSCAN. Since the silhouette clustering score is well suited for convex clusters and DBSCAN is known to return complex non-convex clusters, we forgo calculating any clustering scores and focus on the analysis of the clusters returned by the algorithm.
Manipulating the parameters of DBSCAN, we found the clustering effects were stable — the clusters contained similar counts, and customers did not traverse between non-anomalous and anomalous clusters.
Also analyzing and trying to describe various clusters we find it easier to describe qualities of each cluster, for example:
- one small group contrary to most groups had no purchase, no payment transactions, and no cash withdrawals, but very few relatively high transfers by mobile channel,
- another small group also had no purchase and no payment transactions, however, made cash withdrawals,
- yet another small group had the highest zakat payments (for religious causes) and high amount of mobile transactions per month,
- The group considered as anomalous (cluster coded with -1) with over 300 customers differentiated itself with falling values across most types of transactions (transfers, payments, purchases, cash withdrawals) but sharply rising university fees.
Important to note is that for various sets of features within the data provided here, clustering score for both hierarchical as well as DBSCAN methods returned even better clustering efficiency scores. However, at this level of anonymity (i.e. without the ground truth information), one cannot decide the best split of customers. It might transpire there is a relatively different optimal set of features that best splits customers and provides better entropy scores of these groups calculated on the creditworthiness category.