A predictive modeling approach from web scraping, data preparation and dataset building to an easy-to-use app deployment to identify best startups investment strategies.
Impact investors aim to deploy capital in a way that creates positive social and/or environmental effects, in addition to a financial return. There tend to be more opportunities for impact investments in developing and emerging markets, rather than developed markets. This is because far more capital is required in developing markets to achieve a real social impact, compared to developing markets.
One of the challenges of impact investing is how to measure success. The main reason for this being the difficulty to attain consistency of measurement across different investments because of the vast array of impacts, a lack of commonality, and difficulty in quantifying and comparing impact. Furthermore, striking the perfect balance between achieving return targets and ensuring the community issues the investors are attempting to resolve to receive adequate funding, is another obstacle impact investing is currently navigating. This Omdena challenge is aimed at building a tool that aids Impact Investors in understanding which startups are highly impactful.
Understanding the problem – What, Why, and How
Data science projects are the amalgamation of domain knowledge and engineering. To begin with, the team had to understand everything about Impact investment so the data and target could be defined accordingly.
Research has it that financial factors can be leveraged to determine startup success. However, as the early-stage startups lack sufficient financial information, typically for the initial years, our goal was to identify the non-financial factors that are crucial for startup success.
The challenge, to begin with, was to define success as a quantitative measure. We first curated an extensive list of non-financial factors through a sophisticated research methodology. Previous research, historical data analysis, guided studies, and articles by trusted sources were used to do so. Some factors considered were Social Media presence, Industry type, Innovation, Diversity and Age of founder, Location, Number of Employees, Start Date, and suchlike.
Thereafter, we distinguished the factors that could potentially be used to define the success of an early-stage startup and classified them as quantifiable and non-quantifiable. Such as operating years, the number of employees, being acquired or going public, average annual revenue growth, website status (active or not), and latest funding round.
Finally, after gaining a solid understanding of the know-how of our challenge, we were all set to start collecting the relevant data!
Data collection and dataset preparation
Data collection has been a crucial part of this challenge as the quantity and quality of the data would further determine how accurate results the model is able to provide.
The initial approach to collect data was divided into the following subtasks:
- Developing a web scraping tool that will scrape a startup’s information given the URL of a startup’s website.
- Building a questionnaire that covers data that is not available by scraping based on the success factors curated.
Our initial approach was to scrape a website that has information about all the startups in one place like AngelList or Dealsroom. We aimed to scrape details like:
- State, Location
- Diversity of founder
- Age of founder
- Years of industry/working experience of the founder
- Number of previous jobs of founder
- Proactive customer approach
- Relevant social network
- Startup network/Graph
- Mode of the startup operations, and many many more
Building a scraping tool on a website that already has information of the majority of an early-stage startup would make the development of scraper easier as the code will not have to consider the complexities created by different web structures of different startup websites. Another approach was to develop a tool based on Name Entity Recognition that will be able to extract similarities between different website’s HTML code and scrape the required information.
Web scraping and difficulties faced in data available
Therefore, instead of developing a web-scraper tool we focused on scraping different websites like LinkedIn, Eu-startup, Startuptracker, 500.co, track.in and others using Selenium and Beautiful Soup and stored them into separate CSV files. Scraping data from AngelList or Dealsroom turned out to be a challenge due to their legal terms and conditions which forbade scraping data from those sites. The websites that we used for scraping lacked the information that was necessary for determining the success of a startup and merging these CSV files was not feasible due to the unavailability of common features that would create a standardized dataset for building a machine learning model.
For example, we scrapped EU-startups that returned 9k+ European startups with 9 features as follows:
- articles (articles about the startup on the internet)
Similar features were scrapped from other websites as well which are not enough to determine the probability of the success rate of a startup.
Apart from web scraping, we focused on building a questionnaire that would cover information that is not collectible through a data scraping bot. Our aim was to forward this questionnaire amongst the founding members of the startups to get the required information.
After having gathered the necessary data through all feasible options, we had our ultimate list of collectible and quantifiable non-financial success factors and metrics!
Data Cleaning and misleading values
No matter how many charts we create, how well sophisticated the algorithms are, the results are always misleading without Data Cleaning.
“I kind of have to be a master of cleaning, extracting, and trusting my data before I do anything with it.” ~ Scott Nicholson
Our goal was to turn data into information and information into insight by data cleaning of the collected data and merging common factors from different datasets for further model building and training.
Following were the sub-tasks:
- Exploring and Inspecting collected datasets.
- Dealing with missing values/Nan values.
- Merging/Standardization of different available datasets.
Firstly, know your data! We explored it and got basic Insights.
Missing values represented by ‘?’, ‘N/A’, 0, or just a blank cell, could affect the model building and cause misleading results. So how can we deal with these missing values?
- First, we checked with the source of data if it were possible to find out what the actual value should be.
- We still had missing values so we dropped them off as they were few. We also made sure that the data we were removing had none to minimal impact on the success-defining variable.
- Lastly, we also filled some of the Nan values with mode, mean, or “other” based on the type of column. Numerical data columns were filled with mode whereas textual data columns such as Industry were filled with “Others” when a row was null value.
Data is usually collected from different Datasets by different people and stored in different formats. Converting this data into a common standard of expression and common variable, i.e., consistent, enables users to process and analyze it.
We also did some data cleaning using Tableau Prep, including:
- Removing Fields that did not contain values
- Replacing NULL with 0 for a column with Boolean data type
- Fixing the Amount to ‘Raise’ and ‘Status Funding’ columns, where the data in the column was mixed with each other.
The data still needed some work!
So, we cleaned and deleted variables with less information using the ratio of the missing values and then proceeded to the missing values ablation by sampling from the data distribution. Based on the topological structure of data, we reduced it using tsne in a 2D dataset in order to work only with significant synthetic features. Now each company had one row instead of multiple rows.
We also added a column with the number of founders and another showing how long after finishing their studies afterward they founded the companies/startups.
Moving forward with the idea of merging the datasets, after getting basic insights from the dataset and the list of dropped out features from different dataset, we created a summary of the datasets we had based on:
- The success factors for which information in at least one dataset was available, and definitions of success
- Datasets that have at least 5 or more features in common. But even then, they’re not necessarily the success factors we are looking at.
We finally created a dataset comprising all the ideas from different startups in different datasets, along with the success metrics that were available in each dataset. Eventually, the data was ready for further use!
Exploratory Data Analysis
Exploratory Data Analysis (EDA) was a critical process in our project as this defines the story that can be told with our data. This step is used to perform an initial investigation, summarise the findings, create an overview of feature statistics, find patterns, and detect anomalies in the data. One of the most important functions of this step is to find correlations between features and proposed target functions.
Feature engineering was performed to consolidate the output of 2-3 features into a single feature. For example, the revenue, product usage data were made available to us. We created a percentage increase in revenue and user function to consolidate the output into meaningful results. The funding feature was converted into a categorical variable as the status of funding was of utmost importance rather than the amount. This is because it is difficult to set thresholds for successful funding across startup size, age, and industry. The incorporated date was transformed into year for the sake of relevance, nan values filled with other dates and modes. New features year_incorporated and year_segment were introduced. It was important to classify the year segments into before 2010 and after 2010 due to drastic changes in startup dynamics.
There were a lot of meaningful insights derived out of the visualization stage of this task. A map visualization revealed that Latin American countries are resident to the maximum number of startups, as well as South Africa. There is an insufficient number of startups located in African countries. This is alarming as there is drastic community development required in these regions.
Another visualization revealed that Consumer Services, Health and Wellness, and Financial Services were the top 3 industries among startups. Other important insights covered revenue growth, users growth, paying users growth, and number of transactions growth which were segmented by country and industry
Model building and deployment process led to the identification of three (3) high-performing models. Four (4) metrics have been used to assess the performance of those models:
- Accuracy: Ideal models are expected to give accuracy over 80%.
- Less Sensitivity to Overfitting: This metric expresses the ability of the model to generalize, independently of some patterns inside the data. This metric is assessed by the difference between in-train and in-test accuracy. Ideal models are expected to have mostly similar in-train and in-test accuracies.
- Class Disentanglement Flexibility: This metric expresses the ability of the model to identify each class or to assign observation to the corresponding class. It’s assessed by the harmonic mean (F1-score) between precision and recall.
- Ease and Speed in Deployment: This metric expresses the algorithmic complexity and the heaviness of the model weights in the output results.
From the Machine Learning models like Logistic Regression, KNN, Random Forest Classifier, Gaussian Naive Bayes, SVM, XGBoost that were trained against the data, XGBoost gave the best results based on its confusion matrix, log loss, AUC, precision, recall, and f1-score.
We also trained the same data for 1000 epochs and 9 dense layers using Keras with an accuracy of 0.92097
Deployment on Mia Marketplace
Mia Marketplace is a no-code easy-to-use app builder for machine learning models
The models that gave the best results were finally deployed on Mia Marketplace.
To know more about Mia and its working, head on to this article.
Given the ever-increasing market of environment-friendly investments, now more than ever, the success of impact startups needs to be predicted by factors other than financial so that investors can provide the right funding and help them grow to their true potential.
Following a data science pipeline, using non-financial features including but not limited to Number of Employees, Age of Company, Revenue, Business Stage, Industry, Location, Number of Competitors, the team was able to build models that could generate a prediction score (%) for a particular startup which could predict whether it could be successful. Total Funding was the target variable used in the chosen models for this project.
Scraping relevant information via URL is currently not possible in MIA, but could be possible in the future to aid in the data collection process. This could eventually lead to the scenario wherein one could input thousands of URLs into the model and generate a set number of startups with the highest probability of success.