A step-by-step guide on creating the machine learning model to help startups’ investments strategies by using a SHAP (SHapley Additive exPlanations) value to understand the contribution of each feature into the model, and the importance of each factor to a successful startup.

 

Author: Bima Putra Pratama

 

Imagine you already create a robust machine learning model. Then someone asks you how each input feature contributes to the model output. This condition is a common scenario where we need to explain our model. However, it can be challenging, especially with more complex machine learning models. So, how can we explain our machine learning model easily? 

This project was part of the Omdena challenge, where we have an opportunity to work together with many startups. The main focus of our partners was to find a way of maximizing job and economic impact through startup investment. Hence, we were working together to achieve those goals in this project.

 

Problem and Objectives

In this project, we identified the two main problems where we want to focus on related to start-up investment. These are the lack of research and how we can predict the success of a startup and boost the investor confidence to start investing.

  • Research: while there is research on the purely financial factors in startup success, the research on early-stage startups is sparse (especially in the first two years after founding). This is because startups generally won’t have much financial information available or have not generated large enough sums of money for financial data to be of much value in evaluating success. While there is some qualitative research data available, this data is not particularly suitable for investors looking to make wise decisions. 
  • Startup Success Prediction & Investor Confidence: The market for environmentally-friendly investments is rapidly growing. Unfortunately, many of the wealthiest investors are often reluctant to invest in startups – especially impact startups – due to the lack of relevant research available, the higher risks involved, and the lack of business history that can predict future success. Investing in startups is globally understood to be risky and not an optimal asset management strategy. However, there are plenty of startups with tremendous potential, but they may lose out on funding and success due to this bias and lack of insight, trust, and understanding of startup investing. If a project doesn’t already have funding, it becomes difficult to raise new funding.

 

To address those problems, we need to find a way to help investors decide which startups have a high potential for success and make a significant impact. Thus our goal is to help our partners create a machine learning model to predict the potential startups and understand the contribution of each input factor in the model results. 

 

The Solution

We collaborate to build and deploy a machine learning model based on past historic investments in this two-month challenge. 

Startup investment predictions

Solution Approach. Source: Omdena

 

We begin by understanding what the problems are and how we can address them. Then our team began to collect the data from various sources. From that data, the collaborators perform the data exploration and data transformation to prepare for modeling. Several machine learning techniques were used, and the best results were deployed using the Mia platform. Read more here about the data preparation and dataset building process to how the end app was deployed.

This article will focus on one part of the process, which is the modeling part. Here we will walk through a step-by-step guide on creating the machine learning model itself and use a SHAP (SHapley Additive exPlanations) to understand the contribution of each feature into the model. To learn more about the SHAP values, you can refer to this Github page 

 

Step 1 – Importing Library

First, let’s import the required library. In this project, we use various popular libraries like pandas, NumPy, scikit-learn, xgboost, and shap.

import pandas as pd
import numpy as np
import xgboost
import shap
import seaborn as sns
import matplotlib.pyplot as plt
import pycaret
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import model_selection

 

Step 2 – Preparing the Dataset

The dataset we worked on, contains information about various startups from 113 countries across industries and in the different business stages. 

These are the list of data point that available in the dataset:

  • Company_id: Unique identifier of startup
  • Country: Source startup countries
  • Industry: Company industry types
  • Business_stage: Stage of business
  • Incorporated: Incorporated label
  • Incorporation_date: Year of incorporated                   
  • Number of team members: Total number of team member
  • Accelerator: is part of the accelerator
  • Amount_to_raise: Amount required to raise
  • funding at the time of application: Total successful funding at the time of application
  • Number_employees: Total number of employee
  • Revenue_1month: Revenue in the first month                       
  • Revenue_2month: Revenue in the second month
  • Revenue_3month: Revenue in the third month                       
  • Users_1month: Total number of users in the first month
  • Users_2month: Total number of users in the second month
  • Users_3month: Total number of users in the third month                         
  • Paying_users_1month: Total number of users that pay in the first month                  
  • Paying_users_2month: Total number of users that pay in the second month                  
  • Paying_users_3month: Total number of users in the third month
  • Number_transactions_1month: Total number of transactions in the first month           
  • Number_transactions_2month: Total number of transactions in the second month
  • Number_transactions_3month: Total number of transactions in the third month           
  • Burn_rate: Rate of dollar burned
  • number of competitors: Total number of the competitor in the same business model
  • Role_0: First founder role
  • Gender_0: First founder gender           
  • Country_0: First founder country origin
  • Role_1: First founder role                            
  • Gender_1: First founder gender                            
  • Company_logo: Is the company have a logo                  
  • Amount of funding raised 
  • Total Funding: Total fundraised
  • FUNDED: Flag about is the company funded or not     
  • status_of_funding_AMOUNT: Amount of funding
  • status_of_funding_STATUS: Funding status
  • Website_active: Is the website still active
  • Lat: Latitude
  • Lon: Longitude                                  
  • Year_incorporated: the incorporated year                    
  • Year_segment: segment incorporated year
  • revenue_model_commission_Imputed: is the company have a commission revenue model
  • revenue_model_product_Imputed: is the company have a product revenue model
  • revenue_model_on-demand_Imputed: is the company have an on-demand revenue model
  • revenue_model_subscription_Imputed: is the company have a subscription revenue model
  • revenue_model_freemium_Imputed: is the company have a freemium revenue model
  • revenue_model_advertising_Imputed: is the company have an advertising revenue model
  • revenue_model_licensing_Imputed: is the company have a licensing revenue model
  • customer_type – B-to-B-to-C_Imputed: is the customer type B-to-B-to-C
  • customer_type – B-to-B_Imputed: is the customer type B-to-B
  • customer_type – B-to-C_Imputed: is the customer type B-to-C
  • customer_type – B-to-G_Imputed: is the customer type B-to-G
  • Percent_increase_in_revenue: Percent increase in revenue
  • Percent_increase_in_users: Percent increase in the number of users

 

Before using this dataset for modeling, some preprocessing tasks need to be done to prepare the input feature and the target variables.

 

Preparing Input Features

Although the data has been prepared before, we still need to perform some data preparation to prepare for modeling. These are the list of task that we have done to prepare the input features:

  • Clean the industry column. After examining this column, we found out that some industries relate to each other. So, we grouped similar categories and created new categories. As a result, we can reduce the categories from 39 to 30 categories. Then we perform one-hot encoding for this column.
  • Clean the business stage. We drop two business stages for this column: Dead and Acquired because only one startup is in that stage. Then we do perform label encoding to change from string to integer values based on the business stage order.
  • Clean incorporated year column. Using the information from this column, we remove the startup which incorporated before 2012.
  • Clean funding status. Here we group the funding status that has similar names. Then we perform label encoding based on the order of this status.
  • Calculating team size. We calculate the team size by adding the number of employees and the number of team members.
  • Calculating company growth. Here we calculate the growth of the company by averaging the last three months of company performance. The metrics include total revenue, number of users, number of paying users, and transactions.
  • Dropping unused columns. Lastly, we drop the column that would not be used for modeling. 

 

We can use the code below to clean the dataset.

def data_prep(df): 
  # 1. Clean Industry column
  rules = {

    'Financial': ((df['industry'] == 'Financial Services: Fin Tech') |
                 (df['industry'] == 'Financial Services') |
                 (df['industry'] == 'Finance')),
    'Consumer': ((df['industry'] == 'Consumer Services') |
                 (df['industry'] == 'Consumer Products') |
                 (df['industry'] == 'Consumer Goods')),
    'Health': ((df['industry'] == 'Health: Health Tech') |
               (df['industry'] == 'MedTech / BioTech') |
               (df['industry'] == 'Health Tech') |
               (df['industry'] == 'Health / Wellness')),
    'Enterprise': ((df['industry'] == 'Enterprise Services') |
                   (df['industry'] == 'Enterprise Products')),
    'Education': ((df['industry'] == 'Education: Ed Tech') |
               (df['industry'] == 'Education'))    
     }
  

  df['industry'] = np.select(rules.values(), rules.keys(), default=df['industry'])  
  df = pd.get_dummies(df, columns=["industry"], prefix=["industry_type"] )
  

  # 2. Clean Business Stage Columns
  df.drop(df[(df['business_stage'] =='Dead') | 
           (df['business_stage'] =='Acquired')].index, inplace = True)  
  rules = {

    0: (df['business_stage'] == 'Other'),
    1: (df['business_stage'] == 'Idea Stage'),
    2: (df['business_stage'] == 'Development Stage'),
    3: (df['business_stage'] == 'Beta Testing Stage'),
    4: (df['business_stage'] == 'Pre-Revenue Stage'),
    5: (df['business_stage'] == 'Revenue Stage'),
    6: (df['business_stage'] == 'Expansion Stage'),
  } 

  df['business_stage'] = np.select(rules.values(), rules.keys(), default=df['business_stage']) 

   # 3. Clean Incorporation Year
  df.drop(df[df['year_incorporated'] < 2012].index, inplace = True)
  

  # 4. Clean Funding Status
  rules = {
    'Closed': ((df['status_of_funding_STATUS'] == 'Closed') |
                 (df['status_of_funding_STATUS'] == 'Round is Closed') |
                 (df['status_of_funding_STATUS'] == 'Round closed')),
    'Started': ((df['status_of_funding_STATUS'] == 'Round started') |
               (df['status_of_funding_STATUS'] == 'Just started') |
               (df['status_of_funding_STATUS'] == 'Just started raising')),
    'About to start': (df['status_of_funding_STATUS'] == 'Round about to start')
  }
 

  df['status_of_funding_STATUS'] = np.select(rules.values(), rules.keys(), default=df['status_of_funding_STATUS'])
  
  rules = {
    0: (df['status_of_funding_STATUS'] == 'No funding'),
    1: (df['status_of_funding_STATUS'] == 'Good leads'),
    2: (df['status_of_funding_STATUS'] == 'FUNDED'),
    3: (df['status_of_funding_STATUS'] == 'Will start in next 6-12 months'),
    4: (df['status_of_funding_STATUS'] == 'Will start in next 3-6 months'),
    5: (df['status_of_funding_STATUS'] == 'About to start'),
    6: (df['status_of_funding_STATUS'] == 'Started'),
    7: (df['status_of_funding_STATUS'] == 'Committed less than 50% of round'),
    8: (df['status_of_funding_STATUS'] == 'Committed more than 50% of round'),
    9: (df['status_of_funding_STATUS'] == 'Closing the round in the next 2-4 weeks'),
    10: (df['status_of_funding_STATUS'] == 'About to close'),
    11: (df['status_of_funding_STATUS'] == 'Closed'),
    12: (df['status_of_funding_STATUS'] == 'Not applicable')
  } 

  df['status_of_funding_STATUS'] = np.select(rules.values(), rules.keys(), default=df['status_of_funding_STATUS'])  

  # 5. Calculate Team Size

  df['team_size'] = df['number_employees'] + df['Number of team members']

 
  # 6. Calculate Company Growth

  df['avg_revenue'] = (df['revenue_1month']+df['revenue_2month']+df['revenue_3month'])/3
  df['avg_users'] = (df['users_1month']+df['users_1month']+df['users_1month'])/3
  df['avg_paying_users'] = (df['paying_users_1month']+df['paying_users_2month']+df['paying_users_3month'])/3
  df['avg_number_transaction'] = (df['number_transactions_1month']+df['number_transactions_2month']+df['number_transactions_3month'])/3
  

  # 7. Drop unused columns

  df = df.drop(['role_0','gender_0','country_0','role_1','gender_1'
                ,'country', 'Number of team members', 'number_employees'
                ,'revenue_1month','revenue_2month','revenue_2month'
                ,'users_1month','users_2month','users_3month'
                ,'paying_users_1month','paying_users_2month','paying_users_3month'
                ,'number_transactions_1month','number_transactions_2month','number_transactions_3month'
                ,'lat','lon','percent_increase_in_revenue','percent_increase_in_users'],axis = 1)

  return df

 

Preparing Target Variable

We use two definitions of successful startups. First is the startup that has total funding of at least $750,000. The second definition is a startup that has operated for more than five years. Thus we convert these metrics into label columns that will be used as our target variables. We will create a model and explain the input variable’s contribution to predicting those two variables.   This is the code that we used to calculate the target variables.

def calculate_target(df):

  # Labeling raised startup who have raised more than $750,000 and safe it as raised
  ## Create a new label column

  df.loc[df['Total Funding']>750000,'raised'] = 1
  df.loc[df['Total Funding']<=750000,'raised'] = 0

  ## Delete total funding column
  df.drop(['Total Funding'], axis =1, inplace = True)

  # Labelling start up that can operate more than five year.
  ## Calculating age

  df['Age_today'] = 2021-df['year_incorporated']

  ## Labelling startup

  df["is_more_than_5_years"] = np.where(df['Age_today'] >= 5, 1, 0)

  ## Dropping column used for label

  df = df.drop(['Age_today'], axis = 1)

  return df

 

Step 3 – Model Building 

As we mention in step 2, we have two target variables. Since we want to understand the contribution of each predictor, we will create two machine learning models and use both of the targets for each model. Initially, we are experimenting with various machine learning models. Then after comparing the performances, we ended up using XGBoost as our final model. To learn more about XGBoost, you can check on the documentation here. 

Model for Total Raised

First, we created a prediction model with total funding as the target variable and stored it in the column with the name raise. We will use all the prepared input variables except the information related to funding. We also handle the imbalanced dataset with SMOTE technique. As a result, we have a model with a test set accuracy of 95.3% with the following confusion matrix. 

Confusion matrix of the total raised model. Source: Omdena

Confusion matrix of the total raised model. Source: Author

You can check the code below:

def modeling_total_raised (df):

  df = df.drop(['year_incorporated',
                      'funding at the time of application',
                      'company_id',
                      'raised'], axis = 1)

  # Set target and predictor
  y = dum_df['is_more_than_5_years']
  X = dum_df.drop(['is_more_than_5_years'], axis = 1)

 
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

 
  # Handling Imbalance with SMOTE
  over = SMOTE(sampling_strategy=1)

  X_train, y_train = over.fit_resample(X_train, y_train)
  X_test, y_test = over.fit_resample(X_test, y_test)
 
  y_train.value_counts()
 
  model_xgb = xgboost.XGBClassifier() 
  model_xgb.fit(X_train,y_train)
 
  # Evaluation

  y_pred_xgb = model_xgb.predict(X_test)
  y_prob_xgb = model_xgb.predict_proba(X_test)
 
  matrix_xgb = confusion_matrix(y_test,y_pred_xgb)
  report_xgb = classification_report(y_test,y_pred_xgb)
 

  # Evaluation Summary

  print('Training Set Accuracy: {}%'.format(round(model_xgb.score(X_train, y_train)*100,2)))
  print('Testing Set Accuracy: {}%'.format(round(model_xgb.score(X_test, y_test)*100,2)))
 

  print('\nConfusion Matrix: \n',matrix_xgb)
  print('\nModel Report: \n', report_xgb)
  
  return model_xgb

 

Model for Operating Years

Second, we make a classification model with the target variable of how long the company can operate. In our case, a company that can run for more than five years is considered successful. We follow the same approach when modeling the total fund. From this model, we can create a model with 83% accuracy in the model for the test set.

Confusion matrix of the operating year model. Source: Omdena

Confusion matrix of the operating year model. Source: Author

You can check the code below:

def modeling_total_raised (df):  

  df = df.drop(['year_incorporated',
                      'funding at the time of application',
                      'company_id',
                      'raised'], axis = 1)
  

  # Set target and predictor
  y = df['is_more_than_5_years']
  X = df.drop(['is_more_than_5_years'], axis = 1)
 
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
 
  # Handling Imbalance with SMOTE
  over = SMOTE(sampling_strategy=1)
  X_train, y_train = over.fit_resample(X_train, y_train)
  X_test, y_test = over.fit_resample(X_test, y_test)
 
  y_train.value_counts()
 
  model_xgb = xgboost.XGBClassifier()
 
  model_xgb.fit(X_train,y_train) 

  # Evaluation
  y_pred_xgb = model_xgb.predict(X_test)
  y_prob_xgb = model_xgb.predict_proba(X_test)
 
  matrix_xgb = confusion_matrix(y_test,y_pred_xgb)
  report_xgb = classification_report(y_test,y_pred_xgb)
 
  # Evaluation Summary

  print('Training Set Accuracy: {}%'.format(round(model_xgb.score(X_train, y_train)*100,2)))
  print('Testing Set Accuracy: {}%'.format(round(model_xgb.score(X_test, y_test)*100,2)))
 
  print('\nConfusion Matrix: \n',matrix_xgb)
  print('\nModel Report: \n', report_xgb)  

  return model_xgb

 

Step 4 – Explaining Model

Now, let’s get into the most exciting part, which is explaining the model. We will be using SHAP values to interpret our model. To simplify, SHAP values sum the difference between the expected output of the model and the current output for the startup. Note that for the SHAP implementation, the margin output of the model is explained, not the transformed output. This condition means that the units of the SHAP values for this model are the log odds ratios. Large positive values mean a startup is likely to succeed, while large negative values mean otherwise. To help with the interpretation, the graph below shows how changes in log-odds convert to the probability of success.

This graph shows the relation of log odds of success with the probability of success. Source: Omdena

This graph shows the relation of log odds of success with the probability of success. Source: Author

 

Model for Total Raised

The shap Python package makes this process relatively easy. To use this, we first call shap.TreeExplainer(model) then we call explainer(X) to explain every prediction followed by calling the shap.summary_plot(shap_values,X). You can check run this code below

explainer = shap.TreeExplainer(model_xgb)
shap_values = explainer(X)
shap.summary_plot(shap_values, X)


As a result, we will get a summary plot of our SHAP values for each predictor. Every company has one dot on each row. The x position of the dot is the impact of that feature on the model’s prediction for the company, and the color of the dot represents the value of that feature for the company. Dots that don’t fit on the row pile up to show density.

The SHAP value summary for total raised model. Source: Omdena

The SHAP value summary for total raised model. Source: Author

By doing this for all features, we can see which features highly affect the prediction results and only affect the prediction a little. Note that when points don’t fit together on the line they pile up vertically to show density. Each dot is also colored by the value of that feature from high to low.   In summary, these are the exciting influence factors that affect our model output to predict if the company will likely secure total funding of more than $750,000 or not.

  • Year incorporated: The higher the year, the less likely it is to consider as successful. This makes sense since the new startup will secure less than the startup which already operates longer.
  • Industry type other: If the industry category is Other, they are less likely to be successful. It might be because the industry where the startup operates is not popular and has minimal market potential, which seems not very attractive for investors.
  • Burn rate: Higher burn rate became a good indicator towards startup success in securing total funding. One possible explanation is because, with the high burn rate, the startup could grow faster and then attract more investors to give funds.

 

Those are examples of how we can interpret the summary plot. You can continue to explain the other variables.    In addition, we also can break down the contribution by creating a scatter plot for each feature compared to the SHAP value to get more detailed information. These are the example:

The SHAP value example for individual features of the total raised model. Source: Omdena

The SHAP value example for individual features of the total raised model. Source: Author

You can achieve this by running on these lines of code.

for i in X_train.columns:

  shap.plots.scatter(shap_values[:,i])

 

Model for Operating Years

The same approach was applied for our second model as well. As a result, we got a SHAP value summary to understand the impact of each feature on the prediction of which startup is able to operate for more than five years.

The SHAP value summary for total operating years. Source: Omdena

The SHAP value summary for total operating years. Source: Author

The average number of transactions was the most influential factor that affected the model output. We can see that most of the higher values add more to the positive prediction result. However, there was some low value that also contributed towards positive. Then we also can see the incorporated value, and if the startup is incorporated, this gives more positive prediction results.  For this target variable, let’s see more detail on the revenue model. How has the revenue model affected the success of the company?

The SHAP value example for individual features of the total operating years. Source: Omdena

The SHAP value example for individual features of the total operating years. Source: Author

Looking at the plot above, we can understand that the revenue model contributing to positive results was the product, advertising, licensing, and subscription. On the other hand, the startups with the revenue model of on-demand, commission, or advertising are otherwise.

 

Conclusion

Finally, we arrive at the end of this article. Here we go through a step-by-step guide to build a machine learning model to predict the high potential success and explain the model. We begin by understanding the goal, preparing the dataset, and creating the model. Then we are using the SHAP value to understand the contribution of each input feature to the prediction results. This step-by-step method should give you an idea of creating and explaining your model results. The shap package is easy to install through pip, and we hope it helps you explore your models with confidence. 

 

References

 

 

Develop Your Career and Make a Real-World Impact

Innovation

The world´s only place for truly collaborative AI projects to apply your skills on real-world data with changemakers from around the world.

Apply & grow your skills in our real-world projects

Upcoming AI Projects

AI Teams

Make an impact in our upcoming projects in Natural Language Processing, Computer Vision, Machine Learning, Remote Sensing, and more.

Check out our projects!

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here