AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

Steps towards building an ethical credit scoring AI system for individuals without a previous bank account.

 

 

The background

With traditional credit scoring system, it is essential to have a bank account and have regular transactions, but there are a few groups of people especially in developing nations that still do not have a bank account for a variety of reasons; they do not see the need for it, some are unable to produce the necessary documents, for some the cost of opening the accounts is high, some may not have the knowledge about opening accounts, lack of awareness, trust issues and some unemployed.

Some of these individuals may need loans for essentials; maybe to start a business or like farmers who need a loan to buy fertilizers or seeds. While many of them may be reliable creditors but because they do not get access to funding, they are being pushed to take out high-cost loans from non-traditional, often predatory lenders.

Low-income individuals have an aptitude for managing their personal finances. And we need a system for ethical credit scoring AI in order to help these borrowers and clutch them from falling into deeper debts.

Omdena partnered with Creedix to build an ethical AI-based credit scoring system so that people get access to fair and transparent credit.

 

The problem statement

 

The goal was to determine the creditworthiness of an un-banked customer with alternate and traditional credit scoring data and methods. The data was focused on Indonesia but the following approach is applicable in other countries.

It was a challenging project and I believe everyone should be eligible for a loan for essential business ventures but they should be able to pay it back while not having to pay exorbitant interest rates. Finding that balance was crucial for our project.

 

The data

Three datasets were given to us,

1) Transactions

Information on transactions made by different account numbers, the region, mode of transaction, etc.

 

2) Per capita income per area

All the data is privacy law compliant.

 

3) Job title of the account numbers

All data given to us was anonymous as privacy was imperative and not an afterthought.

Going through the data we understood we had to use unsupervised learning since the data was not labeled.

Some of us were comparing online available data sets to the data set we had at hand, and some of us started working on sequence analysis and clustering to find anomalous patterns of behavior. Early on, we measured results with silhouette score — a heuristic tool to figure out if the parameters we had would provide significant clusters. The best value is 1 with well separable clusters, and the worst is -1 with strongly overlapping ones. We got average values close to 0s, and these results were not satisfactory to us.

 

Feature engineering

With the given data we performed feature engineering. We calculated per ca-pita income score and segregated management roles from other roles. We also calculated the per capita income score so that we can place buckets into accounts in areas that are likely to be reliable customers. For example. management roles mean they would have a better income to pay back.

 

 

 

But even with all the feature engineering, we were unable to get a signal from the data given for clustering. How did we proceed?

We scraped data online from different sites like indeed and numbeo. Since we had these challenges we were not able to give one solution to the customer and had to improvise to provide a plan for future analysis, so we used dummy data.

We scraped data from sites like numbeo to get the cost of living per area, how much they spend on living. From indeed we got salary data to assign an average salary to the jobs.

 

 

 

With the data, scraped online and feature engineering from the given dataset, we tried to figure out if we can get a prediction from using clustering algorithms.

 

The solutions

  • Engineered Features & Clusters (from Datasets given)
  • Machine Learning Pipelines/Toolkit (for Datasets not provided)
  • Unsupervised Learning Pipeline
  • Supervised Learning Pipeline (TPOT/auto-sklearn)

 

1. Engineered features & clusters

 

 

As mentioned above, with the context that we have gathered from Creedix, we have engineered or aggregated many features based on the transaction time series dataset. Although these features describe each customer better, we can only guess the importance of each feature with regards to each customer’s credit score based on our research. So, we have consolidated features for each customer based on our research on credit scoring AI. As for the importance of each feature with regards to credit scoring AI in Indonesia, this will be up to the Creedix team to decide.

For example,

CreditScore = 7*Salary + 0.5*Zakut + 4000*Feature1 + …+ 5000*Feature6

Solutions given to Creedix were both Supervised Learning and Unsupervised Learning. Even after all the feature engineering and data found online we were still getting a low silhouette score signifying that there would be overlapping clusters.

So we decided that we will provide solutions for Supervised Learning using Auto ML and Unsupervised learning, both using dummy variables, the purpose -was to serve future analysis or future modeling for the Creedix Team.

The dataset we used for Supervised Learning — https://www.kaggle.com/c/GiveMeSomeCredit/data

With Supervised Learning, we did modeling with both TPOT and Auto SKLearn. This was done so that when we have more features available that are accessible to them but may not be for Omdena collaborators they can use the information to build their models. When they have target variables to use.

 

2. The model pipeline for Supervised Learning

Our idea is to create a script that can take any datasets and automatically search for the best algorithm by iterating through all classifiers/regressors, hyperparameters based on user-defined metrics.

Our initial approach was to code from scratch iterating individual algorithms from packages (e.g. sklearn, XGBoost and LightGBM) but then we came across Auto ML packages that already do what we wanted to build. Thus, we decided to use those readily available packages instead and not spend time reinventing the wheel.

We used two different auto ML packages TPOT and Auto-sklearn. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines and finding the best one for your data.

 

 

Auto-sklearn frees an ML user from algorithm selection and hyperparameter tuning. It leverages recent advances:

  • Bayesian optimization
  • Meta-learning
  • Ensemble construction

Both TPOT and auto-sklearn are similar, but TPOT stands out between the two due to its reproducibility. TPOT is able to generate both the model and also its python script to reproduce the model.

 

3. Unsupervised Learning

In the beginning, we used agglomerative clustering (a form of hierarchical clustering) since the preprocessed dataset contains a mix of continuous and categorical variables. As we have generated many features from the dataset (some of them very similar ones, based on small variations in their definition), first we had to eliminate most of the correlated ones. Without this, the algorithm would struggle to find the optimal number of groupings. After this task, we remained with the following groups of features:

  • counts the number of transactions per month (cpma),
  • average increase/decrease in value of specific transactions (delta),
  • average monthly specific transaction amount (monthly amount),

and three single specific features:

  • Is Management — assumed managerial role,
  • Potential overspend — value estimating assumed monthly salary versus expenses form the dataset,
  • Spend compare — how customer’s spending (including cash withdrawals) differs from average spending within similar job titles.

 

In a range of potential clusters from 2 to 14, the average silhouette score was best with 8 clusters — 0.1027. The customer data was sliced into 2 large groups and 6 much smaller ones, which was what we were looking for (smaller groups could be considered anomalous):

 

 

This was not a satisfactory result, anyway. On practical grounds, describing clusters 3 to 8 proved challenging, which is correspondent with a relatively low clustering score.

 

 

It has to be remembered that the prime reason for clustering was to find reasonably small and describable anomalous groupings of customers.

We, therefore, decided to apply an algorithm that is efficient with handling outliers within a dataset — DBSCAN. Since the silhouette clustering score is well suited for convex clusters and DBSCAN is known to return complex non-convex clusters, we forgo calculating any clustering scores and focus on the analysis of the clusters returned by the algorithm.

Manipulating the parameters of DBSCAN, we found the clustering effects were stable — the clusters contained similar counts, and customers did not traverse between non-anomalous and anomalous clusters.

Also analyzing and trying to describe various clusters we find it easier to describe qualities of each cluster, for example:

  • one small group contrary to most groups had no purchase, no payment transactions, and no cash withdrawals, but very few relatively high transfers by mobile channel,
  • another small group also had no purchase and no payment transactions, however, made cash withdrawals,
  • yet another small group had the highest zakat payments (for religious causes) and high amount of mobile transactions per month,
  • The group considered as anomalous (cluster coded with -1) with over 300 customers differentiated itself with falling values across most types of transactions (transfers, payments, purchases, cash withdrawals) but sharply rising university fees.

 

 

Important to note is that for various sets of features within the data provided here, clustering score for both hierarchical as well as DBSCAN methods returned even better clustering efficiency scores. However, at this level of anonymity (i.e. without the ground truth information), one cannot decide the best split of customers. It might transpire there is a relatively different optimal set of features that best splits customers and provides better entropy scores of these groups calculated on the creditworthiness category.

Analysing Sexual Abuse at the Workplace Using Supervised Learning

Analysing Sexual Abuse at the Workplace Using Supervised Learning

How oversampling and supervised learning yielded great results for classifying cases of sexual abuse.

 

By Mertcan Coskun

 

 

Nowadays, the severity of sexual abuse is gaining more and more traction, not just in the USA but throughout the whole world.

To help combat the problem, I joined an Omdena project together with the Zero Abuse Project. Among 45 Omdena collaborators from across 6 continents, the goal was to build AI models to identify patterns in the behavior of institutions when they cover-up incidents of sexual abuse.

 

My task: Overcoming an imbalanced data set

When it comes to data science, sexual abuse is an imbalanced data problem, meaning there are few (known) instances of harassment in the entire dataset.

An imbalanced problem is defined as a dataset which has disproportional class counts. Oversampling is one way to combat this by creating synthetic minority samples.

Together with other collaborators, I worked on an AI tool that evaluates the risk factors that suggest potential predatory individuals within an organization and those associated with the cover-up.

Our data consists of sexual abuse instances at work and their features. The data is provided by UNICEF.

Instances of sexual harassment is a reported case of sexual harassment which is concluded by law enforcement. The risk factors are going to be our features in the dataset. Features include; state, number of relocation and institution the person is connected to.

Since the nature of the data is sensitive and unique, I have predicted probabilities rather than classes for the prediction output type. In such questions, predicting either 0 or 1 may be too controversial.

To cope with the data imbalance problem and sensitivity, I decided to apply oversampling and implement a random forest model (supervised learning) to analyze the sexual abuse patterns.

 

The power of oversampling

SMOTE — Synthetic Minority Over-sampling Technique — is a common oversampling method widely used in machine learning with imbalanced high-dimensional data. The SMOTE technique generates randomly new examples or instances of the minority class from the nearest neighbors of a line joining the minority class sample to increase the number of instances. SMOTE creates synthetic minority samples using the popular K nearest neighbor algorithm.

K nearest neighbors draw a line between the minority points and generate points in the middle of the line. It is a technique that was experimented on, nowadays one can find many different versions of SMOTE which build upon the classic formula. Let’s visualize how oversampling affects the data in general.

 

Visual representation of data without oversampling

 

 

Visual representation of data with oversampling

 

For visualization’s sake, two features are picked and from their distribution, it’s clearly seen that the minority samples match the majority sample count.

 

Impact on the predictions

Let’s compare the predictive power of oversampling vs. not oversampling. Random Forest is used as the predictor in both cases. The ProWSyn version of oversampling is selected as the highest performing oversampling method after all the methods are compared using this[1] Python package.

Let’s check the performance of models pre and post oversampling.

 

ROCAUC without oversampling

 

 

ROCAUC with oversampling

 

With ProWSyn oversampling implemented, we can see a 13% increase in ROCAUC score, which is the Area Under the Receiver Operating Characteristic curve, from 84% to 97%. I was also able to decrease the Brier Score[2], which is a metric for probability prediction, by 5%.

As you can see from the results, oversampling can significantly boost your model performance when you have to deal with an imbalanced dataset. In my case, ProWSyn version of SMOTE performed the best but this depends always on the data and you should try different versions to see which one works the best for you.

 

What is ProWSyn and why does it work so well?

Most oversampling methods lack a proper process of assigning correct weights for minority samples. This results in a poor distribution of generated synthetic samples. Proximity Weighted Synthetic Oversampling Technique (ProWSyn) generates effective weight values for the minority data samples based on the sample’s proximity information, i.e., distance from the boundary which results in a proper distribution of generated synthetic samples across the minority data set.[3]

 

What is the output?

 

x: number of instances; y: probability

 

After the prediction, the histogram of predicted probabilities looks like the image above. The distribution turned out the be the way I imagined. The model has learned from the many features and it turns out there is a correlation within the feature space which at the end creates such a distinct difference between classes 0 and 1. In simpler terms, there is a pattern within 0 and 1 classes’ features.

More care has to be put into probabilities really close to 1 (100% probability). From the histogram plot above, we can see that the number of points near 100% probability is quite high. It is normal to dismiss someone as a non-predator but much harder to accuse someone, therefore that number should be lower.

 

What’ next?

I shared a description of applying supervised learning for sexual abuse data.

I was able to identify the main problem, which was the class proportion in target values. Since predicting probabilities in such a sensitive subject required a well-functioning and thought out model, I wanted to fix the biggest problem by creating synthetic instances of sexual harassment in the dataset and have the model it that way. As a result, the predicted probabilities, or red flags, have shown a high level of Brier Score and AUC which means a higher probability prediction performance.

These high scores mean a much better predictive performance in plain English. But this is a double-edged sword, as the model would have a large number of highly probable sexual harassment entities on future data.

Since this machine learning task is much more sensitive than for example predicting the price of second-hand cars, these high probabilities may lead to complications. Having more training data and using a very high threshold may overcome this problem.

 

About Omdena

Want to become an Omdena Collaborator and join one of our real world projects, apply here.

 

Sources

[1] https://github.com/analyticalmindsltd/smote_variants

[2] https://link.springer.com/chapter/10.1007/978-3-642-37456-2_27

[3] https://en.wikipedia.org/wiki/Brier_score

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here