Predicting the Important Factors of a Successful Startup using SHAP Value
July 12, 2021
A step-by-step guide on creating the machine learning model to help startups’ investments strategies by using a SHAP (SHapley Additive exPlanations) value to understand the contribution of each feature into the model, and the importance of each factor to a successful startup.
Imagine you already create a robust machine learning model. Then someone asks you how each input feature contributes to the model output. This condition is a common scenario where we need to explain our model. However, it can be challenging, especially with more complex machine learning models. So, how can we explain our machine learning model easily?
This project was part of the Omdena challenge, where we have an opportunity to work together with many startups. The main focus of our partners was to find a way of maximizing job and economic impact through startup investment. Hence, we were working together to achieve those goals in this project.
Problem and Objectives
In this project, we identified the two main problems where we want to focus on related to start-up investment. These are the lack of research and how we can predict the success of a startup and boost the investor confidence to start investing.
- Research: while there is research on the purely financial factors in startup success, the research on early-stage startups is sparse (especially in the first two years after founding). This is because startups generally won’t have much financial information available or have not generated large enough sums of money for financial data to be of much value in evaluating success. While there is some qualitative research data available, this data is not particularly suitable for investors looking to make wise decisions.
- Startup Success Prediction & Investor Confidence: The market for environmentally-friendly investments is rapidly growing. Unfortunately, many of the wealthiest investors are often reluctant to invest in startups – especially impact startups – due to the lack of relevant research available, the higher risks involved, and the lack of business history that can predict future success. Investing in startups is globally understood to be risky and not an optimal asset management strategy. However, there are plenty of startups with tremendous potential, but they may lose out on funding and success due to this bias and lack of insight, trust, and understanding of startup investing. If a project doesn’t already have funding, it becomes difficult to raise new funding.
To address those problems, we need to find a way to help investors decide which startups have a high potential for success and make a significant impact. Thus our goal is to help our partners create a machine learning model to predict the potential startups and understand the contribution of each input factor in the model results.
The Solution
We collaborate to build and deploy a machine learning model based on past historic investments in this two-month challenge.
We begin by understanding what the problems are and how we can address them. Then our team began to collect the data from various sources. From that data, the collaborators perform the data exploration and data transformation to prepare for modeling. Several machine learning techniques were used, and the best results were deployed using the Mia platform. Read more here about the data preparation and dataset building process to how the end app was deployed.
This article will focus on one part of the process, which is the modeling part. Here we will walk through a step-by-step guide on creating the machine learning model itself and use a SHAP (SHapley Additive exPlanations) to understand the contribution of each feature into the model. To learn more about the SHAP values, you can refer to this Github page
Step 1 – Importing Library
First, let’s import the required library. In this project, we use various popular libraries like pandas, NumPy, scikit-learn, xgboost, and shap.
import pandas as pd import numpy as np import xgboost import shap import seaborn as sns import matplotlib.pyplot as plt import pycaret from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn import model_selection
Step 2 – Preparing the Dataset
The dataset we worked on, contains information about various startups from 113 countries across industries and in the different business stages.
These are the list of data point that available in the dataset:
- Company_id: Unique identifier of startup
- Country: Source startup countries
- Industry: Company industry types
- Business_stage: Stage of business
- Incorporated: Incorporated label
- Incorporation_date: Year of incorporated
- Number of team members: Total number of team member
- Accelerator: is part of the accelerator
- Amount_to_raise: Amount required to raise
- funding at the time of application: Total successful funding at the time of application
- Number_employees: Total number of employee
- Revenue_1month: Revenue in the first month
- Revenue_2month: Revenue in the second month
- Revenue_3month: Revenue in the third month
- Users_1month: Total number of users in the first month
- Users_2month: Total number of users in the second month
- Users_3month: Total number of users in the third month
- Paying_users_1month: Total number of users that pay in the first month
- Paying_users_2month: Total number of users that pay in the second month
- Paying_users_3month: Total number of users in the third month
- Number_transactions_1month: Total number of transactions in the first month
- Number_transactions_2month: Total number of transactions in the second month
- Number_transactions_3month: Total number of transactions in the third month
- Burn_rate: Rate of dollar burned
- number of competitors: Total number of the competitor in the same business model
- Role_0: First founder role
- Gender_0: First founder gender
- Country_0: First founder country origin
- Role_1: First founder role
- Gender_1: First founder gender
- Company_logo: Is the company have a logo
- Amount of funding raised
- Total Funding: Total fundraised
- FUNDED: Flag about is the company funded or not
- status_of_funding_AMOUNT: Amount of funding
- status_of_funding_STATUS: Funding status
- Website_active: Is the website still active
- Lat: Latitude
- Lon: Longitude
- Year_incorporated: the incorporated year
- Year_segment: segment incorporated year
- revenue_model_commission_Imputed: is the company have a commission revenue model
- revenue_model_product_Imputed: is the company have a product revenue model
- revenue_model_on-demand_Imputed: is the company have an on-demand revenue model
- revenue_model_subscription_Imputed: is the company have a subscription revenue model
- revenue_model_freemium_Imputed: is the company have a freemium revenue model
- revenue_model_advertising_Imputed: is the company have an advertising revenue model
- revenue_model_licensing_Imputed: is the company have a licensing revenue model
- customer_type – B-to-B-to-C_Imputed: is the customer type B-to-B-to-C
- customer_type – B-to-B_Imputed: is the customer type B-to-B
- customer_type – B-to-C_Imputed: is the customer type B-to-C
- customer_type – B-to-G_Imputed: is the customer type B-to-G
- Percent_increase_in_revenue: Percent increase in revenue
- Percent_increase_in_users: Percent increase in the number of users
Before using this dataset for modeling, some preprocessing tasks need to be done to prepare the input feature and the target variables.
Preparing Input Features
Although the data has been prepared before, we still need to perform some data preparation to prepare for modeling. These are the list of task that we have done to prepare the input features:
- Clean the industry column. After examining this column, we found out that some industries relate to each other. So, we grouped similar categories and created new categories. As a result, we can reduce the categories from 39 to 30 categories. Then we perform one-hot encoding for this column.
- Clean the business stage. We drop two business stages for this column: Dead and Acquired because only one startup is in that stage. Then we do perform label encoding to change from string to integer values based on the business stage order.
- Clean incorporated year column. Using the information from this column, we remove the startup which incorporated before 2012.
- Clean funding status. Here we group the funding status that has similar names. Then we perform label encoding based on the order of this status.
- Calculating team size. We calculate the team size by adding the number of employees and the number of team members.
- Calculating company growth. Here we calculate the growth of the company by averaging the last three months of company performance. The metrics include total revenue, number of users, number of paying users, and transactions.
- Dropping unused columns. Lastly, we drop the column that would not be used for modeling.
We can use the code below to clean the dataset.
def data_prep(df): # 1. Clean Industry column rules = { 'Financial': ((df['industry'] == 'Financial Services: Fin Tech') | (df['industry'] == 'Financial Services') | (df['industry'] == 'Finance')), 'Consumer': ((df['industry'] == 'Consumer Services') | (df['industry'] == 'Consumer Products') | (df['industry'] == 'Consumer Goods')), 'Health': ((df['industry'] == 'Health: Health Tech') | (df['industry'] == 'MedTech / BioTech') | (df['industry'] == 'Health Tech') | (df['industry'] == 'Health / Wellness')), 'Enterprise': ((df['industry'] == 'Enterprise Services') | (df['industry'] == 'Enterprise Products')), 'Education': ((df['industry'] == 'Education: Ed Tech') | (df['industry'] == 'Education')) } df['industry'] = np.select(rules.values(), rules.keys(), default=df['industry']) df = pd.get_dummies(df, columns=["industry"], prefix=["industry_type"] ) # 2. Clean Business Stage Columns df.drop(df[(df['business_stage'] =='Dead') | (df['business_stage'] =='Acquired')].index, inplace = True) rules = { 0: (df['business_stage'] == 'Other'), 1: (df['business_stage'] == 'Idea Stage'), 2: (df['business_stage'] == 'Development Stage'), 3: (df['business_stage'] == 'Beta Testing Stage'), 4: (df['business_stage'] == 'Pre-Revenue Stage'), 5: (df['business_stage'] == 'Revenue Stage'), 6: (df['business_stage'] == 'Expansion Stage'), } df['business_stage'] = np.select(rules.values(), rules.keys(), default=df['business_stage']) # 3. Clean Incorporation Year df.drop(df[df['year_incorporated'] < 2012].index, inplace = True) # 4. Clean Funding Status rules = { 'Closed': ((df['status_of_funding_STATUS'] == 'Closed') | (df['status_of_funding_STATUS'] == 'Round is Closed') | (df['status_of_funding_STATUS'] == 'Round closed')), 'Started': ((df['status_of_funding_STATUS'] == 'Round started') | (df['status_of_funding_STATUS'] == 'Just started') | (df['status_of_funding_STATUS'] == 'Just started raising')), 'About to start': (df['status_of_funding_STATUS'] == 'Round about to start') } df['status_of_funding_STATUS'] = np.select(rules.values(), rules.keys(), default=df['status_of_funding_STATUS']) rules = { 0: (df['status_of_funding_STATUS'] == 'No funding'), 1: (df['status_of_funding_STATUS'] == 'Good leads'), 2: (df['status_of_funding_STATUS'] == 'FUNDED'), 3: (df['status_of_funding_STATUS'] == 'Will start in next 6-12 months'), 4: (df['status_of_funding_STATUS'] == 'Will start in next 3-6 months'), 5: (df['status_of_funding_STATUS'] == 'About to start'), 6: (df['status_of_funding_STATUS'] == 'Started'), 7: (df['status_of_funding_STATUS'] == 'Committed less than 50% of round'), 8: (df['status_of_funding_STATUS'] == 'Committed more than 50% of round'), 9: (df['status_of_funding_STATUS'] == 'Closing the round in the next 2-4 weeks'), 10: (df['status_of_funding_STATUS'] == 'About to close'), 11: (df['status_of_funding_STATUS'] == 'Closed'), 12: (df['status_of_funding_STATUS'] == 'Not applicable') } df['status_of_funding_STATUS'] = np.select(rules.values(), rules.keys(), default=df['status_of_funding_STATUS']) # 5. Calculate Team Size df['team_size'] = df['number_employees'] + df['Number of team members'] # 6. Calculate Company Growth df['avg_revenue'] = (df['revenue_1month']+df['revenue_2month']+df['revenue_3month'])/3 df['avg_users'] = (df['users_1month']+df['users_1month']+df['users_1month'])/3 df['avg_paying_users'] = (df['paying_users_1month']+df['paying_users_2month']+df['paying_users_3month'])/3 df['avg_number_transaction'] = (df['number_transactions_1month']+df['number_transactions_2month']+df['number_transactions_3month'])/3 # 7. Drop unused columns df = df.drop(['role_0','gender_0','country_0','role_1','gender_1' ,'country', 'Number of team members', 'number_employees' ,'revenue_1month','revenue_2month','revenue_2month' ,'users_1month','users_2month','users_3month' ,'paying_users_1month','paying_users_2month','paying_users_3month' ,'number_transactions_1month','number_transactions_2month','number_transactions_3month' ,'lat','lon','percent_increase_in_revenue','percent_increase_in_users'],axis = 1) return df
Preparing Target Variable
We use two definitions of successful startups. First is the startup that has total funding of at least $750,000. The second definition is a startup that has operated for more than five years. Thus we convert these metrics into label columns that will be used as our target variables. We will create a model and explain the input variable’s contribution to predicting those two variables. This is the code that we used to calculate the target variables.
def calculate_target(df): # Labeling raised startup who have raised more than $750,000 and safe it as raised ## Create a new label column df.loc[df['Total Funding']>750000,'raised'] = 1 df.loc[df['Total Funding']<=750000,'raised'] = 0 ## Delete total funding column df.drop(['Total Funding'], axis =1, inplace = True) # Labelling start up that can operate more than five year. ## Calculating age df['Age_today'] = 2021-df['year_incorporated'] ## Labelling startup df["is_more_than_5_years"] = np.where(df['Age_today'] >= 5, 1, 0) ## Dropping column used for label df = df.drop(['Age_today'], axis = 1) return df
Step 3 – Model Building
As we mention in step 2, we have two target variables. Since we want to understand the contribution of each predictor, we will create two machine learning models and use both of the targets for each model. Initially, we are experimenting with various machine learning models. Then after comparing the performances, we ended up using XGBoost as our final model. To learn more about XGBoost, you can check on the documentation here.
Model for Total Raised
First, we created a prediction model with total funding as the target variable and stored it in the column with the name raise. We will use all the prepared input variables except the information related to funding. We also handle the imbalanced dataset with SMOTE technique. As a result, we have a model with a test set accuracy of 95.3% with the following confusion matrix.
You can check the code below:
def modeling_total_raised (df): df = df.drop(['year_incorporated', 'funding at the time of application', 'company_id', 'raised'], axis = 1) # Set target and predictor y = dum_df['is_more_than_5_years'] X = dum_df.drop(['is_more_than_5_years'], axis = 1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # Handling Imbalance with SMOTE over = SMOTE(sampling_strategy=1) X_train, y_train = over.fit_resample(X_train, y_train) X_test, y_test = over.fit_resample(X_test, y_test) y_train.value_counts() model_xgb = xgboost.XGBClassifier() model_xgb.fit(X_train,y_train) # Evaluation y_pred_xgb = model_xgb.predict(X_test) y_prob_xgb = model_xgb.predict_proba(X_test) matrix_xgb = confusion_matrix(y_test,y_pred_xgb) report_xgb = classification_report(y_test,y_pred_xgb) # Evaluation Summary print('Training Set Accuracy: {}%'.format(round(model_xgb.score(X_train, y_train)*100,2))) print('Testing Set Accuracy: {}%'.format(round(model_xgb.score(X_test, y_test)*100,2))) print('nConfusion Matrix: n',matrix_xgb) print('nModel Report: n', report_xgb) return model_xgb
Model for Operating Years
Second, we make a classification model with the target variable of how long the company can operate. In our case, a company that can run for more than five years is considered successful. We follow the same approach when modeling the total fund. From this model, we can create a model with 83% accuracy in the model for the test set.
You can check the code below:
def modeling_total_raised (df): df = df.drop(['year_incorporated', 'funding at the time of application', 'company_id', 'raised'], axis = 1) # Set target and predictor y = df['is_more_than_5_years'] X = df.drop(['is_more_than_5_years'], axis = 1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # Handling Imbalance with SMOTE over = SMOTE(sampling_strategy=1) X_train, y_train = over.fit_resample(X_train, y_train) X_test, y_test = over.fit_resample(X_test, y_test) y_train.value_counts() model_xgb = xgboost.XGBClassifier() model_xgb.fit(X_train,y_train) # Evaluation y_pred_xgb = model_xgb.predict(X_test) y_prob_xgb = model_xgb.predict_proba(X_test) matrix_xgb = confusion_matrix(y_test,y_pred_xgb) report_xgb = classification_report(y_test,y_pred_xgb) # Evaluation Summary print('Training Set Accuracy: {}%'.format(round(model_xgb.score(X_train, y_train)*100,2))) print('Testing Set Accuracy: {}%'.format(round(model_xgb.score(X_test, y_test)*100,2))) print('nConfusion Matrix: n',matrix_xgb) print('nModel Report: n', report_xgb) return model_xgb
Step 4 – Explaining Model
Now, let’s get into the most exciting part, which is explaining the model. We will be using SHAP values to interpret our model. To simplify, SHAP values sum the difference between the expected output of the model and the current output for the startup. Note that for the SHAP implementation, the margin output of the model is explained, not the transformed output. This condition means that the units of the SHAP values for this model are the log odds ratios. Large positive values mean a startup is likely to succeed, while large negative values mean otherwise. To help with the interpretation, the graph below shows how changes in log-odds convert to the probability of success.
Model for Total Raised
The shap Python package makes this process relatively easy. To use this, we first call shap.TreeExplainer(model) then we call explainer(X) to explain every prediction followed by calling the shap.summary_plot(shap_values,X). You can check run this code below
explainer = shap.TreeExplainer(model_xgb) shap_values = explainer(X) shap.summary_plot(shap_values, X)
As a result, we will get a summary plot of our SHAP values for each predictor. Every company has one dot on each row. The x position of the dot is the impact of that feature on the model’s prediction for the company, and the color of the dot represents the value of that feature for the company. Dots that don’t fit on the row pile up to show density.
By doing this for all features, we can see which features highly affect the prediction results and only affect the prediction a little. Note that when points don’t fit together on the line they pile up vertically to show density. Each dot is also colored by the value of that feature from high to low. In summary, these are the exciting influence factors that affect our model output to predict if the company will likely secure total funding of more than $750,000 or not.
- Year incorporated: The higher the year, the less likely it is to consider as successful. This makes sense since the new startup will secure less than the startup which already operates longer.
- Industry type other: If the industry category is Other, they are less likely to be successful. It might be because the industry where the startup operates is not popular and has minimal market potential, which seems not very attractive for investors.
- Burn rate: Higher burn rate became a good indicator towards startup success in securing total funding. One possible explanation is because, with the high burn rate, the startup could grow faster and then attract more investors to give funds.
Those are examples of how we can interpret the summary plot. You can continue to explain the other variables. In addition, we also can break down the contribution by creating a scatter plot for each feature compared to the SHAP value to get more detailed information. These are the example:
You can achieve this by running on these lines of code.
for i in X_train.columns:
shap.plots.scatter(shap_values[:,i])
Model for Operating Years
The same approach was applied for our second model as well. As a result, we got a SHAP value summary to understand the impact of each feature on the prediction of which startup is able to operate for more than five years.
The average number of transactions was the most influential factor that affected the model output. We can see that most of the higher values add more to the positive prediction result. However, there was some low value that also contributed towards positive. Then we also can see the incorporated value, and if the startup is incorporated, this gives more positive prediction results. For this target variable, let’s see more detail on the revenue model. How has the revenue model affected the success of the company?
Looking at the plot above, we can understand that the revenue model contributing to positive results was the product, advertising, licensing, and subscription. On the other hand, the startups with the revenue model of on-demand, commission, or advertising are otherwise.
Conclusion
Finally, we arrive at the end of this article. Here we go through a step-by-step guide to build a machine learning model to predict the high potential success and explain the model. We begin by understanding the goal, preparing the dataset, and creating the model. Then we are using the SHAP value to understand the contribution of each input feature to the prediction results. This step-by-step method should give you an idea of creating and explaining your model results. The shap package is easy to install through pip, and we hope it helps you explore your models with confidence.
References
This article is written by Bima Putra Pratama.