Modeling Economic Well-being through AI, Satellite Imagery, and Census Data

Modeling Economic Well-being through AI, Satellite Imagery, and Census Data

This article is written by Harshita Chopra along with collaborators Arpan Mishra, Precioso Gabrillo, and Raghunath Pal.

 

 

Economic well-being is a broad concept that goes beyond statistical metrics. When you plan on moving to another place, do you primarily check complex economic measures like the GDP of that region? When making such decisions, generally what matters to most people, in layman terms, is the standard of living. The standard of living refers to the necessities, comforts, and lux­uries which a person is likely to enjoy. It refers to the quantity and quality of their consumption. The fundamental reason for differences in the standards of living between regions is the difference in their levels of economic productivity.

Hence, it is important for nations to record a source of primary data that provides valuable information for planning and formulating policies by governments, international agencies, scholars, business people, industrialists, and many more.

This data is usually collected through on-site surveys that need to be performed across vast areas. A list of questions is asked from families and individuals which leads to a huge database.

 

The Problem

These surveys are conducted over a period of a few years and involve huge manpower and expenditure.

Indian Census 2011 costed INR 2200 Crores (USD 295 million)

There are also associated risks of abuse of data and corruption. Also, the temporal variation of factors affecting economic well-being makes it all the more difficult to compare the progress of regions.

Instead, we try training AI models to learn features related to the changing agricultural and urban landscape thus providing a better understanding of economic well-being.

World Resources Institute (WRI) is a global research organization that spans more than 60 countries and works towards turning big ideas into action at the nexus of environment, economic opportunity, and human well-being. 

 

WRI

 

WRI brings up an enlightening problem statement — Creating a machine learning algorithm that can be used as a proxy for socio-economic well-being in India, using a remote sensing approach through satellite images.

In order to make this possible, Omdena brought together 40 AI -Engineers from 20+ countries, to collaborate on this project. The aim was to create a prototype that can be used to predict variables or features that represent the standard of living of a place, particularly in data-poor regions. This remote approach would use AI and Computer Vision to extract latent features from satellite images, that can help build a baseline model.

 

How we solved it

In this article, we’re going to highlight one of the final delivered models based on Indian Census Data. The aim of this economic well-being AI model is to use satellite images and classify each region as a high, medium, or low economic well-being region.

 

1. Preparing the Ground Truth

In anticipation of the upcoming Census 2021, WRI expressed a major interest in working on its census dataset as there was a need to prepare the model for the upcoming release.

The census data (AI) is an official survey of the population which attempts to gather socio-economic well-being information of households in a specific region and time frame.

Our panel data team mobilized to quickly scrape the Census-2011 website for the District-level household data. There are a number of features representing the condition of houses and assets owned. We wrote a script that extracted these features for each of the 640 districts present in India, in one single CSV file.

We followed the methodology described in this research paper as a guide for our workflow. The census data were subdivided into groups and the features were reduced. We formulated these six variables according to existing research:

  • Fuel for Cooking
  • Main Source of Water
  • Main Source of Light
  • Condition of Household
  • Material of Roof
  • Assets Owned

All of these six variables had three categories —

  • RUD (rudimentary): Features that represent primitive methods such as using firewood, river water, poor house condition, grass roof, etc.
  • INT (intermediate): Features that represent medium-grade methods such as using kerosene for lighting, tubewell water, owning a liveable house, etc.
  • ADV (advance): Features that represent the better condition of households such as owning a car. using electricity, tap water, etc.

After this division, we applied K-means clustering to identify three clusters belonging to the above categories. Each cluster was AI-visualized using a box-plot to associate it with a level of economic well-being.

 

Example: Three Clusters for Variable — Main Source of Water / Source: Omdena

 

In the above image, we see three clusters represented by three plots. Each one is telling if it belongs to the Low, Medium, or High economic well-being class. Cluster 1 depicts ‘High’ (since water_adv is the highest), cluster 2 depicts ‘Medium’ (since water_int is the highest) and cluster 3 depicts ‘Low’ (since water_rud is the highest).

This was done for all 6 variables. After this pre-processing of Census data, our dataset looked like this:

 

 

Source: Omdena

 

 

2. Satellite Image Acquisition

After the ground truth was set up we needed satellite images corresponding to those 640 districts. We had to rely on open-source satellite images, so we selected Google Earth Engine’s services to download these images.

Google Earth Engine provides datasets from various satellites. Since we used census data from 2011 we required the images for the districts corresponding to that year. From the two popular satellites: Sentinel 2 (which has AI data is available from 2015) and Landsat 7 (which has data available from 1999), we selected Landsat 7 Tier 1 TOA Reflectance as our satellite imagery collection, in order to acquire images from 2011.

Landsat 7 images have 30m/pixel resolution which means that every pixel of the image covers 30 meters on earth!

 

Jalgaon, Maharashtra / Source: Omdena

 

Next, we decided on the bands we would need in our satellite image. The images stored in our devices contain 3 bands (Red, Green, and Blue). However, satellite images are multi-banded in nature and may contain up to 12 bands. Not all of them would be useful for us, so we settled on Red, Green, Blue, NIR (Near InfraRed), and SWIR-1 (Shortwave InfraRed) bands.

The image of any particular region will vary depending on what time it was taken, the cloud cover, angle of the satellite, etc. Google Earth Engine allows us to filter out all the best images that we can get for a region throughout our time period and then aggregate them into one single image composite. We downloaded 640 median aggregated image composites, each image corresponding to our districts.

 

3. Creating Features from Images

Every raster that we downloaded contains 5 bands, we merged these bands in different ways to analyze the geographical features of that region.

For example, if we merge the Near Infrared and the Red bands together using the formula, (NIR  —  RED) / (NIR + RED), we will get a single-band image. We call this the NDVI of our image, which stands for Normalized Difference Vegetation Index. 

The special thing about this image is that all the pixels with any shade of green are highlighted. This basically tells us where in the district there is a higher concentration of vegetation. Vegetation is also an indicator of economic well-being, so we can use the NDVI image as an input to our AI model.

 

Different indices highlight different features on the map. See below for more details on each index. Source: Omdena

 

Similarly, we can combine the SWIR band with the other bands to get other indices. Apart from the NDVI we also calculated the NDBI (Normalized Difference Built-Up Index) and the NDWI (Normalized Difference Water Index). As the name suggests the NDBI highlights the concentration of the built-up areas and NDWI highlights the water content, both of which could be indicators of social well-being.

 

4. Model Architecture

The inputs and expected outputs were ready! The task was to build an image classification model. Most of the available pre-trained models are for RGB images. However, satellite images contain multiple bands. So we decided to generate 3-band images using the extracted indices.

To be able to use transfer learning, as well as utilize all the features in our images, we came up with the following architecture:

A Multi-modal Multi-task Deep Learning Model — It inputs two images and outputs values for multiple variables.

 

The Multi-modal Multi-task Deep Learning Model. Source: Omdena

 

We take two inputs:

  • The first is an RGB image, which is just like any normal image we look at, containing the Red, Green, Blue bands.
  • The second is a combination of the NDVI, NDBI, and NDWI of our image.
    NDVI — Normalized Difference Vegetation Index
    NDBI — Normalized Difference Built-up Index
    NDWI — Normalized Difference Water Index
    This new image highlights a mixture of features corresponding to vegetation cover, built-up area, and water bodies of that region.

We then rescale the pixels to the 0–255 range so that the pre-trained models can be used with them.

Our data was divided such that all states are represented well in the model. 80% of districts of each state went to the training set and 20% went to the test set. Our model was subjected to a clever 10-fold cross-validation scheme, which means that the entire dataset was given a chance to undergo the economic well-being prediction by the AI model.

Images are passed through a popular Deep Learning Neural Network architecture — RESNET-18, combined with a fully connected layer to get our desired outputs. The model outputs three classes (high, medium, low) for multiple indicators as mentioned above. Hence, we solved a multi-modal multi-task learning problem. The model achieved an overall accuracy close to 70%.

 

Results and Insights

To analyze the overall development of a region based on the six different indices, we curated an Overall Development Index (ODI) to judge the economic well-being of a region as a whole. The index score calculated for each district ranged from 6 to 18 and was calculated as follows:

Overall Development Index (ODI) = A1+A2+A3+A4+A5+A6, where

Ax = 1 if Ax = “Low”
   = 2 if Ax = “Medium”
   = 3 if Ax = “High”
x: Type of Index

Ground Truth vs Model Predictions of Overall Development Index for Census 2011 / Source: Omdena

 

Driven by our curiosity and project interests, we also decided to deep dive into our data to uncover hidden statistics and actionable insights. Subjecting the data to a popular technique called Exploratory Data Analysis and with the help of BI tools like Tableau, Google Data Studio — we created dashboards to visualize the data in different customizable views.

We discovered that the districts of India were almost evenly distributed in terms of High, Medium, and Low Overall Development.

 

Distribution of Districts by the Overall Development Index for each indicator of economic well-being. Source: Omdena

 

 

Conclusion

Satellite images can act as a great proxy for existing data collection techniques such as surveys and census to predict the economic well-being of a region. It also makes it possible to determine the economic well-being of areas that are inaccessible to humans, for example, the rocky terrains of the northeastern region of India, the Himalayas or villages in the deserts.

The model is highly scalable and adaptable and can be trained on existing satellite imagery and surveys of other countries as well. It can help save a lot of manpower and time which acts as a major challenge in our existing development assessment initiatives. The prototype developed by our team in this eight-week challenge can be a springboard to a wider and in-depth expansion of this machine learning tool for predicting economic well-being.

With rapid advancements in technology, possible future work can include using high-resolution images or other popular datasets such as the Demographic and Health Survey (DHS) or Living Standards Measurement Surveys (LSMS) as the ground truth. Future applications also include tracking urbanization along with vegetation cover over a period of time. This can reflect on how the socio-economic conditions of regions evolve along with changing environmental factors.

AI for Malaria Prevention: Identifying Water Bodies Through Satellite Imagery

AI for Malaria Prevention: Identifying Water Bodies Through Satellite Imagery

By Tanmay Laud

 

Combining satellite images, topography data, population density and other data sources to build an algorithm that identifies the areas in which stagnant water bodies (malaria mosquito breeding sites) likely exist. The model helps to identify breeding sites quicker and more accurately.

You must have read the famous quote by Andrew Ng which highlights the importance of data in today’s world. He says, “It’s not who has the best algorithm that wins. It’s who has the most data.” The statement stands true in most of the data-driven applications, however, the required amount of data is not always available.

“Data is the new oil”- Clive Humby

If you come from the Kaggle world, then the problems of data sourcing might not be known to you. The online competitions begin with a rich corpus of data (that has been annotated and verified by large teams). But, at Omdena, our journey begins with the task of data acquisition. It is a challenging task, especially in the Social Good space, since not many tech giants are focusing on these problems, and as a result, data is not readily available for analysis and churning.

Thus, data collection from the right sources becomes a critical exercise in a machine learning project. That brings us to the Artificial Intelligence Zzapp Malaria project, a project tackled by 50 collaborators from across the world with a common objective —  to provide Artificial Intelligence-driven mechanisms that detect water bodies prone to the breeding of mosquitoes in order to prevent malaria. Let me talk about how my team eventually built a productive dataset from an initially minimal one.

 

The Problem Statement

 

Explaining the purpose/ problem statement of this project

High-Level Objective (Omdena.com)

 

The project falls under the UN’s Sustainable Development Goal 3, which is to“ end the epidemics of AIDS, tuberculosis, malaria and neglected tropical diseases” by the year 2030. Given a region, our task was to automatically identify areas where there are water bodies. We achieved this by pre-surveying areas for malaria-infected water bodies via Artificial Intelligence tools like satellite imagery, topography analysis, and
geo-referenced data. It allows for more cost-effective surveys in new areas.

As you might have realized by now, money plays a big impact on such a project. To be able to cater to a large area like Ghana or Kenya in Africa, you need to be able to direct your resources to the most susceptible regions in the most cost-efficient manner, that too in a stipulated amount of time. The time is limited since you have to treat the water bodies before the wet season arrives, leading to a rise in mosquito breeding.

The dataset that we received was particularly for the Ghana and Amhara regions of the African subcontinent.

What’s interesting? The data did not come all at once.

Zzapp Malaria was surveying these regions during this phase hence the data came in periodic batches. The majority of the data was being sourced during the period of the project as it was a wet season in the above-mentioned areas. As an Artificial Intelligence engineer/ data scientist, you need to align your game plan to this flow of incoming information.

 

Highlighted Grids have a higher risk of containing water bodies

Highlighted Grids have a higher risk of containing water bodies (Omdena.com)

 

The dataset comprised of 3-meter resolution satellite images (each image was 100×100 meters) with labels indicating the number of natural and artificial sources of water bodies in these regions. This was based on a survey conducted by Zzapp field workers. Each image corresponded to a 100×100 MGRS (military-grade geo-co-ordinate system) grid and so we had approximately 200 grids to start with and got around 1500 images by the end of the project.

 

Overcoming data challenges

We had the following challenges in this project

  • Lack of enough data. As explained earlier, the Zzapp data arrived in a periodic manner. Also, since not many organizations are working on data collection or using artificial intelligence for malaria elimination, there is NO pre-annotated dataset (at the time of this project) which one can download and get started with.
  • Lack of high-resolution data. The dataset had a 3-meter resolution which is better than most satellite image sources but is still not as detailed as a Google Maps image.
  • Imagery cannot convey the presence ( or probability of ) of water accumulation in an area. Consider, for example, water collected in a canal covered by a roof. This would be impossible to detect with satellite imagery.

 

How we did it

To solve the lack of data issue, we devised a two-step approach. Firstly, we detected the presence of large water bodies ( lakes, rivers, streams, ponds). This was achieved using State-Of-The-Art vision models like the DeepWaterMap which produced a probability map given a grid. This was, in itself, a useful way to trace the surrounding regions of interest. (Humans tend to settle around large water bodies).

Next, we used the output of the above model as a proxy variable to further detect the risk of water accumulating in smaller cross-sections.

 

Difference between Provided images v/s Google Hybrid images

Provided images v/s Google Hybrid images (Omdena.com)

 

To compensate for the lack of resolution, we created a pipeline that extracts rich images from Google Satellite Hybrid service for corresponding grids given to us. You can see the difference in the details in the 3 references image on the left. You might wonder, why not use super-resolution instead? But using super-res could cause variations that would deviate from the original truth.

Further, since these images alone cannot comprehensively convey water presence, we created more features using population density, vegetation indices, topography, and landcover classifications. Let’s look at each of these factors briefly.

 

1. Population Density 

 

Graph between Population density v/s Land distance to water

Population density v/s Land distance to water (Omdena.com)

 

Research suggests that as the population density in a given region rises, the land distance to water decreases. We thought of leveraging to interpret the risk of mosquito breeding grounds based on how densely populated a region is. The graph on the left roughly highlights this inverse proportionality.

 

2. Vegetation Indices and Height Above Nearest Drainage (HAND)

 

 

Mapping information about Vegetation masks calculated over the region of Ghana and Amhara

Vegetation masks calculated over the region of Ghana and Amhara (Omdena.com)

 

A Vegetation Index (VI) is a spectral transformation of two or more bands designed to enhance the contribution of vegetation properties and allow reliable spatial and temporal inter-comparisons of terrestrial photosynthetic activity and canopy structural variations. Dual polarised (VV and VH) Sentinel-1 Ground Range Detected (GRD) scenes were acquired from Google Earth Engine (https://earthengine.google.com/). All scenes were pre-processed using the following steps:

  • Thermal noise removal,
  • Radiometric calibration
  • Terrain correction.

The HAND data was also exported using Earth Engine. This was used to help eliminate false positives located above the drainage line.

 

3. Landcover Classification

 

Geographical representation of Landcover classification labels sourced from the LandCoverNet dataset

Landcover classification labels sourced from the LandCoverNet dataset (Omdena.com)

 

In order to gain information about the terrain, we used a labeled dataset that was specifically released for the African subcontinent. LandCoverNet is a labeled global land cover classification dataset based on Sentinel-2 data. Version 1 of the dataset contains data across the entire African continent. The dataset is labeled on a pixel-by-pixel basis where each pixel is identified as one of the 10 different land cover classes: “trees cover areas”, “shrubs cover areas”, “grassland”, “cropland”, “vegetation aquatic or regularly flooded”, “lichen and mosses / sparse vegetation”, “bare areas”, “built-up areas”, “snow and/or ice or clouds” and “open water”.

 

4. Topography

 

Digital Elevation Model data for Ghana and Amhara region on map

Digital Elevation Model data for Ghana and Amhara region (Omdena.com)

 

All the topographic features were calculated using SRTM v3 DEM (Digital Elevation Model) data. We used the SAGA API in order to pre-process the DEM dataset and generate topographic features. The DEMs were smoothed to fill in isolated elevation pits (or spikes), which typically represent errors or areas of internal drainage that interrupts the estimate of water flow. Then the following 17 topographic features were generated using the pre-processed elevation tiff:

  • Relative Slope Position
  • Topographic Wetness Index
  • Topographic Position Index (tpi500)
  • Channel Network Distance
  • Convergence Index
  • LS Factor

After generating raster datasets for the above topographic features, these features were projected onto the polygons of interest (positive and negative scan chunks) in Ghana and Amhara. The mean, max, and min of all the pixel values within a given grid were calculated for all of the above features to aggregate them at the MGRS grid level.

The topographic features were instrumental in detecting natural sources of water (both large and small in size) with high AUC which is evident below:

 

bodies in Ghana Region

Actual v/s Predicted Labels for water bodies in Ghana Region (Omdena.com)

 

Bringing it all together

Using the aforementioned data sources, we ended up generating 81 features and after a round of exploratory data analysis, we were able to finalize on the top 20 most relevant features. We then set out to build and validate ensemble models that could best capture the information in each of the data sources. It allowed us to detect both natural and artificial sources of water with a high degree of recall. The higher recall was preferred since the notion of capturing all water sources was more important than inaccurately labeling some regions with having water. The data flow diagram aims to highlight this effort.

 

Omdena

Data Flow Diagram (Omdena.com)

 

 

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

Steps towards building an ethical credit scoring AI system for individuals without a previous bank account.

 

 

The background

With traditional credit scoring system, it is essential to have a bank account and have regular transactions, but there are a few groups of people especially in developing nations that still do not have a bank account for a variety of reasons; they do not see the need for it, some are unable to produce the necessary documents, for some the cost of opening the accounts is high, some may not have the knowledge about opening accounts, lack of awareness, trust issues and some unemployed.

Some of these individuals may need loans for essentials; maybe to start a business or like farmers who need a loan to buy fertilizers or seeds. While many of them may be reliable creditors but because they do not get access to funding, they are being pushed to take out high-cost loans from non-traditional, often predatory lenders.

Low-income individuals have an aptitude for managing their personal finances. And we need a system for ethical credit scoring AI in order to help these borrowers and clutch them from falling into deeper debts.

Omdena partnered with Creedix to build an ethical AI-based credit scoring system so that people get access to fair and transparent credit.

 

The problem statement

 

The goal was to determine the creditworthiness of an un-banked customer with alternate and traditional credit scoring data and methods. The data was focused on Indonesia but the following approach is applicable in other countries.

It was a challenging project and I believe everyone should be eligible for a loan for essential business ventures but they should be able to pay it back while not having to pay exorbitant interest rates. Finding that balance was crucial for our project.

 

The data

Three datasets were given to us,

1) Transactions

Information on transactions made by different account numbers, the region, mode of transaction, etc.

 

2) Per capita income per area

All the data is privacy law compliant.

 

3) Job title of the account numbers

All data given to us was anonymous as privacy was imperative and not an afterthought.

Going through the data we understood we had to use unsupervised learning since the data was not labeled.

Some of us were comparing online available data sets to the data set we had at hand, and some of us started working on sequence analysis and clustering to find anomalous patterns of behavior. Early on, we measured results with silhouette score — a heuristic tool to figure out if the parameters we had would provide significant clusters. The best value is 1 with well separable clusters, and the worst is -1 with strongly overlapping ones. We got average values close to 0s, and these results were not satisfactory to us.

 

Feature engineering

With the given data we performed feature engineering. We calculated per ca-pita income score and segregated management roles from other roles. We also calculated the per capita income score so that we can place buckets into accounts in areas that are likely to be reliable customers. For example. management roles mean they would have a better income to pay back.

 

 

 

But even with all the feature engineering, we were unable to get a signal from the data given for clustering. How did we proceed?

We scraped data online from different sites like indeed and numbeo. Since we had these challenges we were not able to give one solution to the customer and had to improvise to provide a plan for future analysis, so we used dummy data.

We scraped data from sites like numbeo to get the cost of living per area, how much they spend on living. From indeed we got salary data to assign an average salary to the jobs.

 

 

 

With the data, scraped online and feature engineering from the given dataset, we tried to figure out if we can get a prediction from using clustering algorithms.

 

The solutions

  • Engineered Features & Clusters (from Datasets given)
  • Machine Learning Pipelines/Toolkit (for Datasets not provided)
  • Unsupervised Learning Pipeline
  • Supervised Learning Pipeline (TPOT/auto-sklearn)

 

1. Engineered features & clusters

 

 

As mentioned above, with the context that we have gathered from Creedix, we have engineered or aggregated many features based on the transaction time series dataset. Although these features describe each customer better, we can only guess the importance of each feature with regards to each customer’s credit score based on our research. So, we have consolidated features for each customer based on our research on credit scoring AI. As for the importance of each feature with regards to credit scoring AI in Indonesia, this will be up to the Creedix team to decide.

For example,

CreditScore = 7*Salary + 0.5*Zakut + 4000*Feature1 + …+ 5000*Feature6

Solutions given to Creedix were both Supervised Learning and Unsupervised Learning. Even after all the feature engineering and data found online we were still getting a low silhouette score signifying that there would be overlapping clusters.

So we decided that we will provide solutions for Supervised Learning using Auto ML and Unsupervised learning, both using dummy variables, the purpose -was to serve future analysis or future modeling for the Creedix Team.

The dataset we used for Supervised Learning — https://www.kaggle.com/c/GiveMeSomeCredit/data

With Supervised Learning, we did modeling with both TPOT and Auto SKLearn. This was done so that when we have more features available that are accessible to them but may not be for Omdena collaborators they can use the information to build their models. When they have target variables to use.

 

2. The model pipeline for Supervised Learning

Our idea is to create a script that can take any datasets and automatically search for the best algorithm by iterating through all classifiers/regressors, hyperparameters based on user-defined metrics.

Our initial approach was to code from scratch iterating individual algorithms from packages (e.g. sklearn, XGBoost and LightGBM) but then we came across Auto ML packages that already do what we wanted to build. Thus, we decided to use those readily available packages instead and not spend time reinventing the wheel.

We used two different auto ML packages TPOT and Auto-sklearn. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines and finding the best one for your data.

 

 

Auto-sklearn frees an ML user from algorithm selection and hyperparameter tuning. It leverages recent advances:

  • Bayesian optimization
  • Meta-learning
  • Ensemble construction

Both TPOT and auto-sklearn are similar, but TPOT stands out between the two due to its reproducibility. TPOT is able to generate both the model and also its python script to reproduce the model.

 

3. Unsupervised Learning

In the beginning, we used agglomerative clustering (a form of hierarchical clustering) since the preprocessed dataset contains a mix of continuous and categorical variables. As we have generated many features from the dataset (some of them very similar ones, based on small variations in their definition), first we had to eliminate most of the correlated ones. Without this, the algorithm would struggle to find the optimal number of groupings. After this task, we remained with the following groups of features:

  • counts the number of transactions per month (cpma),
  • average increase/decrease in value of specific transactions (delta),
  • average monthly specific transaction amount (monthly amount),

and three single specific features:

  • Is Management — assumed managerial role,
  • Potential overspend — value estimating assumed monthly salary versus expenses form the dataset,
  • Spend compare — how customer’s spending (including cash withdrawals) differs from average spending within similar job titles.

 

In a range of potential clusters from 2 to 14, the average silhouette score was best with 8 clusters — 0.1027. The customer data was sliced into 2 large groups and 6 much smaller ones, which was what we were looking for (smaller groups could be considered anomalous):

 

 

This was not a satisfactory result, anyway. On practical grounds, describing clusters 3 to 8 proved challenging, which is correspondent with a relatively low clustering score.

 

 

It has to be remembered that the prime reason for clustering was to find reasonably small and describable anomalous groupings of customers.

We, therefore, decided to apply an algorithm that is efficient with handling outliers within a dataset — DBSCAN. Since the silhouette clustering score is well suited for convex clusters and DBSCAN is known to return complex non-convex clusters, we forgo calculating any clustering scores and focus on the analysis of the clusters returned by the algorithm.

Manipulating the parameters of DBSCAN, we found the clustering effects were stable — the clusters contained similar counts, and customers did not traverse between non-anomalous and anomalous clusters.

Also analyzing and trying to describe various clusters we find it easier to describe qualities of each cluster, for example:

  • one small group contrary to most groups had no purchase, no payment transactions, and no cash withdrawals, but very few relatively high transfers by mobile channel,
  • another small group also had no purchase and no payment transactions, however, made cash withdrawals,
  • yet another small group had the highest zakat payments (for religious causes) and high amount of mobile transactions per month,
  • The group considered as anomalous (cluster coded with -1) with over 300 customers differentiated itself with falling values across most types of transactions (transfers, payments, purchases, cash withdrawals) but sharply rising university fees.

 

 

Important to note is that for various sets of features within the data provided here, clustering score for both hierarchical as well as DBSCAN methods returned even better clustering efficiency scores. However, at this level of anonymity (i.e. without the ground truth information), one cannot decide the best split of customers. It might transpire there is a relatively different optimal set of features that best splits customers and provides better entropy scores of these groups calculated on the creditworthiness category.

Ethical AI Building Blocks: The Interdependence of Emotional & Artificial Intelligence

Ethical AI Building Blocks: The Interdependence of Emotional & Artificial Intelligence

By Jake Carey-Rand

 

One of my favorite quotes at the moment is from Max Tegmark, MIT professor and author of ‘Life 3.0: Being Human in the Age of Artificial Intelligence’. Tegmark talks about avoiding “this silly, carbon-chauvinism idea that you can only be smart if you’re made of meat” in reference to a more inclusive definition of intelligence to include artificial as well as biological intelligence. I’d like to double down on the requirement for an even more inclusive definition of intelligence – or rather, a more inclusive approach to artificial intelligence (AI). An approach where the emphasis is on diversity and collaboration, for meat lovers, vegans, and robots alike.

Outside the tech biosphere, reservations are often expressed about AI. These moral questions can run even deeper for some of us within the AI sector. Fear that AI will put humans out of a job or learn to wage war against humanity is bounced around the social interwebs at will. But ask a machine learning engineer how the AI she’s been developing actually does what it does, and most often you are met by a bit of a shrug of the shoulders beyond a certain point in the process. The truth is, advanced AI is still a bit of a mystery to us mere humans – even the really smart machine learning humans.

Armed with this context, I won’t argue there aren’t potential downsides. AI is built by people. People decide what data goes into the model. People build models. People train the models and ultimately people decide how to productionize the models and integrate them into a broader workflow or business.

Because all of this is (for the moment) directed by people, it means we have choices. Up to a point – we have a choice about how we create AI, what its tasks are, and ultimately the path we direct it to take. The implications of these choices are crystal clear now more than ever. The power of AI to create a better, healthier and arguably more equitable world is tangible and occurring at a very rapid pace. But so is the dark alternative – people have a choice to create models which spread Fear, Uncertainty, and Doubt to hack an election or to steal money.

AI is a tool like any other… well, almost.

 

Beyond The Tech

The pursuit of ‘AI nirvana’ is thought by some to be a pipedream cluttered with wasted money and resources along the path to mediocre success. Others share a view that AI at-scale is something reserved only for the FAANG companies (plus Microsoft, Uber, etc.). Without diving into the technicalities of data science and machine learning too deeply, the reality is that organizations are still struggling to capture the value of their data with any corresponding models they build. In fact, 87% of data science projects fail to deliver anything of value in production to the business. Challenges I hear time and again from customers, friends and colleagues include:

  • Competing or out of sync business silos
  • Lack of cohesion around a data strategy
  • Data in various formats and locations
  • Lack of clear objectives within the context of broader business transformation

 

The Importance of Soft Skills and Collaboration

Critically, some of the most important characteristics of data science success relate to soft skill development – those which make us uniquely human. Yes, we need great programmers, data wranglers, architects, and analysts for everything from data archeology to model training. But it is just as important (I would argue now more important) to curate emotional intelligence if you want to succeed with artificial intelligence. The success of an organization is now judged more heavily based on its ability to build and maintain Cultural Empathy, Critical Thinking, Problem Solving, and Agile Initiatives. Importantly, these skills also lead to a more natural ability to link data science investment directly to organizational (and social) value.

In other words, instilling a culture of diversity, inclusion, and collaboration is integral to AI and ultimately business success. As an organizational psychologist and professor, Tomas Chamorro-Premuzic said in a 2017 Harvard Business Review article, “No matter how diverse the workforce is, and regardless of what type of diversity we examine, diversity will not enhance creativity unless there is a culture of sharing knowledge.” Collaboration is key.

 

Remove Bias and Enhance Creativity

Out of all the soft skills, the need for an unbiased and collaborative approach to AI is probably the most important thing we can do to more positively impact AI development. Omdena has quickly become the world leader in Collaborative AI, demonstrating rapid success in solving some of the world’s toughest problems. Experts discuss AI bias at length, but remember that humans create AI. We are not perfect and we certainly are not all-knowing. Imagine if all AI were produced by programmers in Silicon Valley. Even they would agree, a model to predict landslides based on drought patterns from satellite imagery in Southeast Asia, would be better done in collaboration with those local to the problem who also understand farming and economics relevant to the region. Likewise, a model built to analyze mortgage default risk based on social sentiment analysis and financial data mining needs to be built by a diverse, collaborative team. As recent history is teaching us, decisions made by the few, expand to elevate systemic division and privilege.

Jack Ma, the world’s wealthiest teacher, said in an address to Hong Kong graduates, ‘Everything we taught our kids over the past 200 years, machines will do better in the future. Educators should teach what machines are not capable of, such as creativity and independent thinking.’

My hope is that schools are adapting to this change, along with all the other changes they must now manage. But for most corporate teams, they have some catching up to do to ensure AI adoption is not only successful but considered a success for all. Let’s start by encouraging a broad, diverse, and collaborative approach to AI. As Tegmark says, “Let’s Build AI that Empowers Us”.

 

Jake Carey-Rand is a technology executive with nearly 20 years of experience across AI, big data, Internet delivery, and web security. Jake recently joined Omdena as an advisor, to help scale the AI social enterprise.

Omdena is the company “Building Real-World AI Solutions, Collaboratively.” I’ve been watching the impact Omdena and its community of 1,200+ data scientists, from more than 82 countries (we call them Changemakers) have been doing over the last 12 months. Their ability to solve absolutely critical issues around the world has been inspiring. It has also led to some questions about how these Changemakers have been able to do what so many organizations fail to do time and time again – create real-world AI solutions in such a short amount of time. This has inspired us to explore how we could scale this engine of AIForGood even faster. The Omdena platform can be leveraged by enterprises who, especially during these challenging times, have to accelerate, adapt, and transform their approach to “business as usual” through a more collaborative approach to AI.

Estimating Street Safeness after an Earthquake with Computer Vision And Route Planning

Estimating Street Safeness after an Earthquake with Computer Vision And Route Planning

Is it possible to estimate with minimum expert knowledge if your street will be safer than others when an earthquake occurs?

 

We answered how to estimate the safest route after an earthquake with computer vision and route management.

 

The problem

The last devastating earthquake in Turkey occurred in 1999 (>7 on the Richter scale) around 150–200 kilometers from Istanbul. Scientists believe that this time the earthquake will burst directly in the city and the magnitude is predicted to be similar.

The main motivation behind this AI project hosted by Impacthub Istanbul is to optimize the Aftermath Management of Earthquake with AI and route planning.

 

Children need their parents!

After kicking off the project and brainstorming with the hosts, collaborators, and the Omdena team about how to better prepare the city of Istanbul for an upcoming disaster, we spotted a problem quite simple but at the same time really important for families: get reunited ASAP in earthquake aftermath!

Our target was set to provide safe and fast route planning for families, considering not only time factors but also broken bridges, landing debris, and other obstacles usually found in these scenarios.

 

Fatih, one of the most popular and crowded districts in Istanbul. Source: Mapbox API

 

 

We resorted to working on two tasks: creating a risk heatmap that would depict how dangerous is a particular area on the map, and a path-finding algorithm providing the safest and shortest path from A to B. The latter algorithm would rely on the previous heatmap to estimate safeness.

Challenge started! Deep Learning for Earthquake management by the use of Computer Vision and Route Management.

 

Source: Unsplash @loic

 

By this time, we optimistically trusted in open data to successfully address our problem. However, we realized soon that data describing buildings quality, soil composition, as well as pre and post-disaster imagery, were complex to model, integrate, when possible to find.

Bridges over streets, buildings height, 1000 types of soil, and eventually, interaction among all of them… Too many factors to control! So we just focused on delivering something more approximated.

 

Computer Vision and Deep Learning is the answer for Earthquake management

The question was: how to accurately estimate street safeness during any Earthquake in Istanbul without such a myriad of data? What if we could roughly estimate path safeness by embracing distance-to-buildings as a safety proxy. The farther the buildings the safer the pathway.

For that crazy idea, firstly we needed buildings footprints laid on the map. Some people suggested borrowing buildings footprints from Open Street Map, one of the most popular open-source map providers. However, we noticed soon Open Street Map, though quite complete, has some blank areas in terms of buildings metadata which were relevant for our task. Footprints were also inaccurately laid out on the map sometimes.

 

Haznedar area (Istanbul). Source: Satellite image from Google Maps.

 

Haznedar area too, but few footprints are shown. Blue boxes depict building footprints. Source: OpenStreetMap.

 

A big problem regarding the occurrence of any Earthquake and their effects on the population, and we have Computer Vision here to the rescue! Using Deep Learning, we could rely on satellite imagery to detect and then, estimated closeness from pathways to them.

The next stone on the road was to obtain high-resolution imagery of Istanbul. With enough resolution to allow an ML model locates building footprints in the map as a standard-visually-agile human does. Likewise, we would also need some annotated footprints on these images so that our model can gracefully train.

 

 

First step: Building a detection model with PyTorch and fast.ai

 

SpaceNet dataset covering the area for Rio de Janeiro. Source: https://spacenetchallenge.github.io/

 

Instead of labeling hundreds of square meters manually, we trusted on SpaceNet (and in particular, images for Rio de Janeiro) as our annotated data provider. This dataset contains high-resolution satellite images and building footprints, nicely pre-processed and organized which were used in a recent competition.

The modeling phase was really smooth thanks to fast.ai software.

We used a Dynamic Unit model with an ImageNet pre-trained resnet34 encoder as a starting point for our model. This state-of-the-art architecture uses by default many advanced deep learning techniques, such as a one-cycle learning schedule or AdamW optimizer.

All these fancy advances in just a few lines of code.

 

fastai fancy plot advising you about learning rates.

 

We set up a balanced combination of Focal Loss and Dice Loss, and accuracy and dice metrics as performance evaluators. After several frozen and unfrozen steps in our model, we came up with good-enough predictions for the next step.

For more information about working with geospatial data and tools with fast.ai, please refer to [1].

 

 

Where is my high-res imagery? Collecting Istanbul imagery for prediction.

Finding high-resolution imagery was the key to our model and at the same time a humongous stone hindering our path to victory.

For the training stage, it was easy to elude the manual annotation and data collection process thanks to SpaceNet, yet during prediction, obtaining high-res imagery for Istanbul was the only way.

 

Mapbox sexy logo

 

Thankfully, we stumble upon Mapbox and its easy-peasy almost-free download API which provides high-res slippy map tiles all over the world, and with different zoom levels. Slippy map tiles are 256 × 256 pixel files, described by x, y, z coordinates, where x and y represent 2D coordinates in the Mercator projection, and z the zoom level applied on earth globe. We chose a zoom level equal to 18 where each pixel links to real 0.596 meters.

 

Slippy map tiles on the Mercator projection (zoom level 2). Source: http://troybrant.net/blog/2010/01/mkmapview-and-zoom-levels-a-visual-guide/

 

As they mentioned on their webpage, they have a generous free tier that allows you to download up to 750,000 raster tiles a month for free. Enough for us as we wanted to grab tiles for a couple of districts.

 

Slippy raster tile at zoom level 18 (Fatih, Istanbul).

 

 

Time to predict: Create a mosaic-like your favorite painter

Once all required tiles were stealing space from my Google Drive, it was time to switch on our deep learning model and generate prediction footprints for each tile.

 

Model’s prediction for some tile in Rio: sometimes predictions looked better than actual footprints.

 

Then, we geo-referenced the tiles by translating from the Mercator coordinates to the latitude-longitude tuple (that used by mighty explorers). Geo-referencing tiles was a required step to create our prediction piece of art with GDAL software.

 
Python snippet to translate from Mercator coordinates to latitude and longitude.

Concretely, gdal_merge.py thecommand allows us to glue tiles by using embedded geo-coordinates in TIFF images. After some math, and computing time… voilà! Our high-res prediction map for the district is ready.

 

Raw predictions overlaid on Fatih. From a lower degree of building presence confidence (blue) to higher (yellow).

 

 

Inverse distance heatmap

Ok, I see my house but should I go through this street?

Building detection was not enough for our task. We should determine distance from a given position in the map to the closest building around so that a person in this place could know how safe is going to be to cross this street. The larger the distance the safer, remember?

The path-finding team would overlay the heatmap below on his graph-based schema and by intersecting graph edges (streets) with heatmap pixels (user positions), they could calculate the average distance for each pixel on the edge and thus obtaining a safeness estimation for each street. This would be our input when finding the best A-B path.

 

Distance-to-buildings heatmap in meters. Each pixel represents the distance from each point to the closest building predicted by our model. Blue means danger, yellow-green safeness.

 

But how to produce this picture from the raw prediction map? Clue: computing distance pixel-building for each tile independently is sub-optimal (narrow view), whereas the same computation on the entire mosaic will render as extremely expensive (3.5M of pixels multiplied thousands of buildings).

Working directly on the mosaic with a sliding window was the answer. Thus, for each pixel (x,y), a square matrix composed by (x-pad, y-pad, x+pad, y+pad) pixels from the original plot is created. Pad indicates the window side length in the number of pixels.

 

Pixel-wise distance computation. Orange is the point, blue is the closest building around. Side length = 100 pixels.

 

If a pixel belongs to some building, it returns zero. If not, return the minimum euclidean distance from the center point to the building’s pixels. This process along with NumPy optimizations was the key to mitigate the quadratic complexity beneath this computation.

Repeat the process for each pixel and the safeness map comes up.

 

Distance heatmap overlaid on the satellite image. Blue means danger, yellow-green safeness.

 

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here