AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

Steps towards building an ethical credit scoring AI system for individuals without a previous bank account.

 

 

The background

With traditional credit scoring system, it is essential to have a bank account and have regular transactions, but there are a few groups of people especially in developing nations that still do not have a bank account for a variety of reasons; they do not see the need for it, some are unable to produce the necessary documents, for some the cost of opening the accounts is high, some may not have the knowledge about opening accounts, lack of awareness, trust issues and some unemployed.

Some of these individuals may need loans for essentials; maybe to start a business or like farmers who need a loan to buy fertilizers or seeds. While many of them may be reliable creditors but because they do not get access to funding, they are being pushed to take out high-cost loans from non-traditional, often predatory lenders.

Low-income individuals have an aptitude for managing their personal finances. And we need a system for ethical credit scoring AI in order to help these borrowers and clutch them from falling into deeper debts.

Omdena partnered with Creedix to build an ethical AI-based credit scoring system so that people get access to fair and transparent credit.

 

The problem statement

 

The goal was to determine the creditworthiness of an un-banked customer with alternate and traditional credit scoring data and methods. The data was focused on Indonesia but the following approach is applicable in other countries.

It was a challenging project and I believe everyone should be eligible for a loan for essential business ventures but they should be able to pay it back while not having to pay exorbitant interest rates. Finding that balance was crucial for our project.

 

The data

Three datasets were given to us,

1) Transactions

Information on transactions made by different account numbers, the region, mode of transaction, etc.

 

2) Per capita income per area

All the data is privacy law compliant.

 

3) Job title of the account numbers

All data given to us was anonymous as privacy was imperative and not an afterthought.

Going through the data we understood we had to use unsupervised learning since the data was not labeled.

Some of us were comparing online available data sets to the data set we had at hand, and some of us started working on sequence analysis and clustering to find anomalous patterns of behavior. Early on, we measured results with silhouette score — a heuristic tool to figure out if the parameters we had would provide significant clusters. The best value is 1 with well separable clusters, and the worst is -1 with strongly overlapping ones. We got average values close to 0s, and these results were not satisfactory to us.

 

Feature engineering

With the given data we performed feature engineering. We calculated per ca-pita income score and segregated management roles from other roles. We also calculated the per capita income score so that we can place buckets into accounts in areas that are likely to be reliable customers. For example. management roles mean they would have a better income to pay back.

 

 

 

But even with all the feature engineering, we were unable to get a signal from the data given for clustering. How did we proceed?

We scraped data online from different sites like indeed and numbeo. Since we had these challenges we were not able to give one solution to the customer and had to improvise to provide a plan for future analysis, so we used dummy data.

We scraped data from sites like numbeo to get the cost of living per area, how much they spend on living. From indeed we got salary data to assign an average salary to the jobs.

 

 

 

With the data, scraped online and feature engineering from the given dataset, we tried to figure out if we can get a prediction from using clustering algorithms.

 

The solutions

  • Engineered Features & Clusters (from Datasets given)
  • Machine Learning Pipelines/Toolkit (for Datasets not provided)
  • Unsupervised Learning Pipeline
  • Supervised Learning Pipeline (TPOT/auto-sklearn)

 

1. Engineered features & clusters

 

 

As mentioned above, with the context that we have gathered from Creedix, we have engineered or aggregated many features based on the transaction time series dataset. Although these features describe each customer better, we can only guess the importance of each feature with regards to each customer’s credit score based on our research. So, we have consolidated features for each customer based on our research on credit scoring AI. As for the importance of each feature with regards to credit scoring AI in Indonesia, this will be up to the Creedix team to decide.

For example,

CreditScore = 7*Salary + 0.5*Zakut + 4000*Feature1 + …+ 5000*Feature6

Solutions given to Creedix were both Supervised Learning and Unsupervised Learning. Even after all the feature engineering and data found online we were still getting a low silhouette score signifying that there would be overlapping clusters.

So we decided that we will provide solutions for Supervised Learning using Auto ML and Unsupervised learning, both using dummy variables, the purpose -was to serve future analysis or future modeling for the Creedix Team.

The dataset we used for Supervised Learning — https://www.kaggle.com/c/GiveMeSomeCredit/data

With Supervised Learning, we did modeling with both TPOT and Auto SKLearn. This was done so that when we have more features available that are accessible to them but may not be for Omdena collaborators they can use the information to build their models. When they have target variables to use.

 

2. The model pipeline for Supervised Learning

Our idea is to create a script that can take any datasets and automatically search for the best algorithm by iterating through all classifiers/regressors, hyperparameters based on user-defined metrics.

Our initial approach was to code from scratch iterating individual algorithms from packages (e.g. sklearn, XGBoost and LightGBM) but then we came across Auto ML packages that already do what we wanted to build. Thus, we decided to use those readily available packages instead and not spend time reinventing the wheel.

We used two different auto ML packages TPOT and Auto-sklearn. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines and finding the best one for your data.

 

 

Auto-sklearn frees an ML user from algorithm selection and hyperparameter tuning. It leverages recent advances:

  • Bayesian optimization
  • Meta-learning
  • Ensemble construction

Both TPOT and auto-sklearn are similar, but TPOT stands out between the two due to its reproducibility. TPOT is able to generate both the model and also its python script to reproduce the model.

 

3. Unsupervised Learning

In the beginning, we used agglomerative clustering (a form of hierarchical clustering) since the preprocessed dataset contains a mix of continuous and categorical variables. As we have generated many features from the dataset (some of them very similar ones, based on small variations in their definition), first we had to eliminate most of the correlated ones. Without this, the algorithm would struggle to find the optimal number of groupings. After this task, we remained with the following groups of features:

  • counts the number of transactions per month (cpma),
  • average increase/decrease in value of specific transactions (delta),
  • average monthly specific transaction amount (monthly amount),

and three single specific features:

  • Is Management — assumed managerial role,
  • Potential overspend — value estimating assumed monthly salary versus expenses form the dataset,
  • Spend compare — how customer’s spending (including cash withdrawals) differs from average spending within similar job titles.

 

In a range of potential clusters from 2 to 14, the average silhouette score was best with 8 clusters — 0.1027. The customer data was sliced into 2 large groups and 6 much smaller ones, which was what we were looking for (smaller groups could be considered anomalous):

 

 

This was not a satisfactory result, anyway. On practical grounds, describing clusters 3 to 8 proved challenging, which is correspondent with a relatively low clustering score.

 

 

It has to be remembered that the prime reason for clustering was to find reasonably small and describable anomalous groupings of customers.

We, therefore, decided to apply an algorithm that is efficient with handling outliers within a dataset — DBSCAN. Since the silhouette clustering score is well suited for convex clusters and DBSCAN is known to return complex non-convex clusters, we forgo calculating any clustering scores and focus on the analysis of the clusters returned by the algorithm.

Manipulating the parameters of DBSCAN, we found the clustering effects were stable — the clusters contained similar counts, and customers did not traverse between non-anomalous and anomalous clusters.

Also analyzing and trying to describe various clusters we find it easier to describe qualities of each cluster, for example:

  • one small group contrary to most groups had no purchase, no payment transactions, and no cash withdrawals, but very few relatively high transfers by mobile channel,
  • another small group also had no purchase and no payment transactions, however, made cash withdrawals,
  • yet another small group had the highest zakat payments (for religious causes) and high amount of mobile transactions per month,
  • The group considered as anomalous (cluster coded with -1) with over 300 customers differentiated itself with falling values across most types of transactions (transfers, payments, purchases, cash withdrawals) but sharply rising university fees.

 

 

Important to note is that for various sets of features within the data provided here, clustering score for both hierarchical as well as DBSCAN methods returned even better clustering efficiency scores. However, at this level of anonymity (i.e. without the ground truth information), one cannot decide the best split of customers. It might transpire there is a relatively different optimal set of features that best splits customers and provides better entropy scores of these groups calculated on the creditworthiness category.

Using Unsupervised Learning on Satellite Images to Identify Climate Anomalies

Using Unsupervised Learning on Satellite Images to Identify Climate Anomalies

 

This work is a part of Omdena’s AI project with the United Nations High Commissioner for Refugees. The objective was to predict forced displacements and violent conflicts as a result of climate change and natural disasters in Somalia.

By Animesh Seemendra

Using unsupervised learning techniques on satellite images for capturing sudden environmental changes (after-effects of natural disasters or conflicts) to provide immediate relief to people affected. The solution functions as an alert system.

 

The problem

Somalia is a small country in the continent of Africa. The country exhibits a lot of natural disasters and terrorism as a result of which people of Somalia go through mass displacements leading towards a situation of lack of food and shelter.

This article shows how to build an anomaly detection system using Machine Learning. The system is capable of capturing sudden vegetation changes, which can be used as an alert mechanism to provide immediate relief to the people and communities in need.

 

 

What is Anomaly Detection?

Anomaly Detection System using satellite images is an area where a lot of research is happening to discover new and better methods.

We approached the problem using unsupervised learning technique i.e using Principal Component Analysis and K-Means. In the case of anomaly detection, unsupervised learning will take multi-temporal images to find changes in the images. Finally, the output map will have highlighted regions of change that could be used to send an alert to representatives at UNHCR if any major deviation occurs between two continuous temporal images.

 

Unsupervised Learning Climate Change

Fig 2: In 2017 Bomb Attack in Mogadishu (Somalia) Kills 276

 

The approach

First try: Convolutional Neural Networks

The first approach that I came up with was to use deep learning techniques, namely CNN+LSTM, where CNN could help extract relevant features from the images and LSTM could help to learn the sequential changes. This way our model could learn the changes that occur gradually and if any major changes such as natural disaster or conflict occurred in that area, the predicted value of our model and actual value would have the difference much greater than the normal value. This would signify that something major has happened to send an alert UNHCR.

As often in the real world, there was not enough data to apply deep learning Therefore we looked for an alternative.

The solution: Less shiny algorithms

The problem of anomaly detection could be solved with both supervised and unsupervised learning techniques. Since the data was not labeled we went with unsupervised learning techniques. Change detection can be solved using NDVI values, PCA analysis, Image difference methods, etc.

We went through some great methods for anomaly detection including a split based approach to unsupervised learning detection[1]. Comparing two images of the same geographical area at two different times pixel by pixel and then using some algorithms like thresholding algorithms, Bayes theory to generate change map[2]. After doing some research I finally went with the PCA + K-means technique [3] as some previous methods were either taking a lot of assumptions or were directly applied to raw data which could bring a lot of noise.

 

The data

For this project, we needed the satellite data of regions from Somalia. The images can be downloaded either from the earth explorer website or from Google Earth Engine API. You must ensure that the data downloaded has cloud coverage as minimal as possible. This is a common problem working with satellite images.

Unsupervised Learning Climate Change

Fig 3: EarthExplorer Image

 

 

The solution: Unsupervised Learning

 

Unsupervised Learning Climate Change

Fig 4: Satellite Image of an area from Somalia. Here you can see a lot of vegetation and greenery

 

Unsupervised Learning Climate Change

Fig 5: Satellite image of the same area at a different time. Here you can see that vegetation is less than in the previous image 4.

 

Calculating the difference between both images

Differences between the two greyscale images were calculated through pixel by pixel subtraction. The computed value will be such that the pixel of areas associated with the change will have a much larger difference than unchanged areas.

Xd = |X1 – X2| where Xd is the absolute difference of the two image intensities.

Unsupervised Learning Climate Change

Fig 6: The difference image of the bi-temporal images shown earlier.

 

Principal Component Analysis

The next step was to create an eigenvector space using PCA. The first step is converting your image into h X h non-overlapping blocks where h can be anything greater than 2. Let’s call these sets of vectors Y. Principal Component Analysis is used to correct for decorrelation caused by atmospheric noise or striping. PCA drops the outline component from the bands and which then can be then used to classify.

 

Creating a feature vector space

The next step was to create a feature vector space. A feature vector space was constructed for each pixel of the difference image by projecting the neighborhood of each pixel on eigenvector space. This was done by creating a h X h overlapping blocks in the neighborhood of each pixel to maintain contextual information. Now we have a clean and high variance set of vectors that can be used for classification.

Clustering

This step involves generating two clusters based on feature vector space by applying K Means. The two clusters will be one that will represent change and others that will represent change. These feature vector already carries the information whether they carry changed pixel or unchanged one. When there is a change between two images in a region, the assumption is that the values of the difference vector over that region will be higher than in other regions. Therefore K Means will partition the data into two clusters based on the distance between cluster average mean and pixel vector. Finally, the change map was constructed with higher values of pixels over regions of change.

 

Fig 7: The highlighted part depicts the difference between the two images. The image is flooded with white spots because there was a lot of loss of vegetation in the two images.

 

The highlighted areas could be further used to examine the extent of change that occurred in a continuous sequence of time and therefore could help UNHCR take necessary actions. Loss of vegetation to such an extent like fig 7 would happen only when sudden large conflicts or natural disasters will occur and thus creating an alarm.

 

Conclusion

In this project, we were able to develop an anomaly detection model using PCA and K Means which could highlight areas of change. The highlighted areas could be further used to examine the extent of change that occurred in a continuous sequence of time and therefore could help UNHCR take necessary actions. Loss of vegetation to such an extent like fig 7 would happen only when sudden large conflicts or natural disasters will occur and thus creating an alarm.

Since cloud coverage is a common problem while working with satellite images (bottom left region of the image), human intervention is required. Hence there is an area of improvement.

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here