AI Insights

A Guide to Tree-based Algorithms in Machine Learning (Including Real Examples)

March 15, 2022


article featured image

Introduction

Imagine a set of machine learning algorithms that can achieve competitive results as its sophisticated counterparts without relying on a large cluster of GPU-accelerated machines, be match-winners at major data science competitions, and all of this without losing out on interpretability. Sounds ideal, doesn´t it? 

Tree-based algorithms can empower predictive models with all the above-mentioned properties and are also widely used in various industries to provide productionised solutions.

What are tree-based machine learning algorithms?

Tree-based algorithms are supervised learning models that address classification or regression problems by constructing a tree-like structure to make predictions. 

The underlying idea of these algorithms is simple: to come up with a series of if-else conditions to create decision boundaries and use an aggregation method (like mean or mode) on values in a decision region to predict the outcome – a target value in case of regression and a target class in case of classification. Tree-based algorithms can either produce a single tree or multiple trees based on the specific algorithm used. 

The following are some of the key properties of tree-based learning that make it an important approach:

  • Flexible: Tree-based algorithms can capture non-linear relationships between the data features and output
  • Non-parametric: These methods do not make any assumption regarding distribution, independence, or constant variance of the underlying data that it processes. This is vital in applications where very little is known about the data and which features to use to make predictions. 
  • Lesser data preprocessing requirements: Unlike distance-based methods, these algorithms do not require feature scaling i.e. normalization or standardization of data before feeding to the model. 
  • High interpretability: Since the decision regions are produced based on boolean relations, these methods can be graphically visualized to gather an intuitive understanding of what the algorithm does.

Some applications of tree-based algorithms are listed below:

  • Highly utilized in fields where modeling and interpreting human behavior is the primary focus. These include marketing use cases and customer retention.
  • Go-to algorithms for medical and finance-based applications such as diagnosing diseases and ailments and fraud detection due to non-parametric approach and high interpretability. 

How do tree-based methods work?

Tree branches

Fig 1: Tree branches (Source)

As mentioned above, tree-based methods create decision regions based on if-else conditions on features by making orthogonal splits. But, how are these splitting conditions devised? Also, the process of creating decision regions could be carried out many times. How many times should we split our decision space?  

These are the key components that need to be addressed while creating trees: which features to split on and at what value, and when to stop splitting.

Features and values to split on

To decide the feature split, different metrics are used to decide the best split in a top-down greedy manner i.e. the splitting begins from a state when all points belong to the same region and successive splits are made such that the resulting tree has a better metric value as compared to the previous tree. The following are some commonly used cost functions for different tasks:

1. For classification:

  • Entropy: Measures the amount of uncertainty or randomness in the data. The objective is to minimize entropy in order to achieve homogeneous regions i.e. regions having data points belonging to a similar class. 
  • Gini index: Measures the likelihood that a randomly selected data point would be misclassified by a particular node. 
  • Information Gain: Measures the reduction in entropy/gini index that occurs due to a split. Tree-based algorithms either use Entropy or Gini index as a criteria to make the most “informative” split i.e. split that reduces the criteria by the most amount. 

2. For regression:

  • Residual Sum of Squares: Measures the sum of squared difference between the target class and the mean response of decision region for each data point in a region.

An orthogonal split is made for a feature and a corresponding feature value that increases the information gain or reduces the residual sum of squares the most as compared to other potential splits. This process is then repeated to perform the next best split and so on.

Fig 2 shows the decision regions generated for a sample dataset.

Fig 2: Decision Tree and Decision Regions

Fig 2: Decision Tree and Decision Regions (source)

How many splits?

As this splitting process is repeated, the tree keeps on growing and becomes more complex. At this point, the algorithm starts learning noise along with the signals present in the dataset. This results in overfitting i.e. the model becomes too specific to a dataset that it is trained on and can not generalize well on other unseen datasets. In order to avoid this situation, a technique called pruning is incorporated.

Pruning aims at getting rid of sections of the tree that have low predictive power. It can be done by limiting the maximum depth of the tree or by limiting the minimum number of samples per region. 

Other pruning methods such as cost complexity pruning get rid of subtrees by updating the cost function with an additional term to penalise complex trees. A large tree is first grown using recursive splitting and cost complexity pruning is then applied to the large tree to find the best sequence of subtrees.  This idea is similar to Lasso Regression, which regularizes the complexity of the model by penalizing weights. 

Types of tree-based methods

Tree-based approaches can classify based on the number of trees used for prediction and the order in which they are produced. Some of the important methods are listed below:  

1. Decision trees

2. Ensemble methods

  • Bagging
  • Random forests
  • Boosting

Let’s deep dive into each of these methods.

1. Decision trees

What are Decision trees?

Fig 3 : Decision Tree

Fig 3: Decision Tree (source)

Decision trees are tree-based structures that involve working with a single tree, using boolean conditions to form decision boundaries until a stopping condition is reached. These can be utilized for classification and regression tasks and hence are popularly termed as Classification and Regression Trees (CART). 

Fig X showcases an example of a decision tree to determine type of contact lens to be worn by a person.

How do they work?

Decision trees work on the principle as described in the previous section. A brief algorithm below describes the steps to grow a decision tree. The algorithm is agnostic to the type of problem in hand (classification or regression):

  • All training instances are assigned to the root of the node i.e. to a single predictor space. 
  • For each feature in the dataset, divide the predictor space into decision regions for each feature value. 
  • Calculate the cost function (e.g. for classification: information gain, for regression: residual sum of squares) for each split performed.
  • Identify the feature and the corresponding feature value which leads to the best split (e.g. for classification: maximum information gain, for regression: minimum residual sum of squares). This feature-feature value combination constitutes the splitting condition.
  • Partition all instances into decision regions based on the splitting condition.
  • For each decision region, continue this process until a stopping condition is reached.

Once the decision tree has been constructed, it can be used to evaluate new instance. The new instance is first placed in the corresponding decision region based on the tree logic and the aggregated measure of all the training target values (e.g. for classification: mode of class values, for regression: mean of outcome value) is predicted for the instance.

Advantages and Disadvantages

Advantages:

  • Easy to understand, interpret and visualize as they mimic a human decision-making process
  • Can handle non-linear relationships between features and target unlike other easy-to-interpret models like linear regression
  • Do not require data to be scaled or normalized before being fed to the model
  • Can work with categorical and numerical data unlike regression models where categorical data needs to be one-hot encoded

Disadvantages:

  • Generally, have poor accuracy and are called weak learners
  • Prone to overfitting
  • A slight change in dataset can lead to a significant change in tree structure

Examples

Due to their interpretability, decision trees have been utilised in wide range of applications:

  • Customer relationship management
  • Fraud detection
  • Fault diagnosis in engineering
  • Energy consumption analysis 
  • Heathcare management
Fig 4: Decision tree for Hepatitis B prediction

Fig 4: Decision tree for Hepatitis B prediction (source)

2. Ensemble Methods

Though pruning helps to avoid overfitting of decision trees, a single tree has  limited predictive power. To address this, multiple decision trees can be built and finally their predictions can be combined to improve predictive power. This method of combining multiple trees to make predictions is called ensembling, which is based on an idea of wisdom of crowds – a crowd is wiser than an individual.

We would inspect some popular ensembling methods: Bagging (Bootstrap Aggregation), Random Forests and Boosting.

2.1. Bagging

What is Bagging?

Bagging or Bootstrap Aggregation is a technique to construct multiple decision trees at a time, each trained using a subset of data. This data subset is obtained by randomly sampling instances with replacement also known as bootstrapping. Finally, the predictions obtained from all decision trees are aggregated to obtain a single prediction. 

Bagging helps in reducing high variance observed while training a single decision tree. This is because aggregating multiple bootstrapped training datasets reduce variance.

How does it work?
Fig 5: Bagging Algorithm

Fig 5: Bagging Algorithm (source)

The following are the steps to perform bagging:

  • Construct M data subsets of instances bootstrapped from the training dataset
  • Train decision tree models on each of the M data subsets
  • Aggregate the results obtained from each of the decision tree models using an aggregation method (for classification: majority voting, for regression: averaging)

To evaluate a new instance, the appropriate decision region in which the instance would lie is identified for each for the decision trees and an aggregation method on the training target values laying in the decision region is used to get predictions from each decision tree. Furthermore, another level of aggregation is used to combine individual predictions and make a collective prediction.

Advantages and Disadvantages

Advantages:

  • Reduces variance observed while using a single decision tree and prevents overfitting
  • Extends to the advantages observed for decision trees
  • Provides information on variable importance. It does so by recording the total amount of change in the cost function due to split by a certain feature and averages it across all the decision trees
  • Helps in feature selection
  • Improves predictive accuracy as compared to decision trees

Disadvantages:

  • Not so easy to understand or interpret due to aggregation of individual predictions
  • Does not address model bias or underfitting
  • Correlation still exists between decision trees
  • More computational resources are required to train this model
Examples

The following are some of the prominent applications where bagging has been used:

  • Predicting onset of diabetes (source)
  • Network intrusion detection systems (source)
  • Assess loan default risk (source)
  • Mapping wetlands type (source)

2.2. Random Forest

What is Random Forest?

Random Forest build upon the Bootstrap Aggregation algorithm by involving an additional step to decorrelate the decision trees. Along with bootstrapping instances for each decision tree, Random Forest also chooses a subset of features from the feature set to train the decision tree. 

By doing this it tends to lower down the correlation between decision trees. Since every decision tree uses all the features, a strong predictor would influence the way decision splits occur in most of the decision trees, making them look similar. Further, aggregation of correlated trees leads to lesser reduction in variance. This situation is avoided when a subset of features is considered for training different decision trees.

How do they work?
Fig 6: Random Forest Algorithm

Fig 6: Random Forest Algorithm

The following are the steps to implement Random Forest:

  • Construct M data subsets using instances bootstrapped from the training dataset
  • Train each decision tree models for each of the M data subsets by using N randomly sampled features from the feature set
  • Aggregate the results obtained from each of the decision tree models using an aggregation method (for classification: majority voting, for regression: averaging)
Advantages and Disadvantages

Advantages:

  • Extends to the advantages of Bagging algorithm
  • Reduces variance further as compared to Bagging 

Disadvantages:

  • Takes longer to train
  • Less interpretable as compared to decision trees
  • Does not reduce bias
Examples

Random Forest has been utilised in the following applications:

  • Predicting climate change and forced displacement (source)
  • Gene expression classification (source)
  • Mapping crop types (source)
Fig 7: Mapping crop types using Random Forest

Fig 7: Mapping crop types using Random Forest

2.3. Boosting

What is Boosting?

Another ensemble approach to improve performance of a decision tree works on the idea of growing trees sequentially instead of parallelly, as we saw for Random Forests. This method is known as Boosting.

The fundamental idea behind boosting is to make trees learn from the errors committed by the predecessors and it does so by stacking tress in a sequential manner. 

This overcomes one of the drawbacks observed in the Random Forest i.e. Random Forest does not reduce bias, and eventually improves the predictive accuracy. 

There are different types of boosting approaches. Some of the popular ones include AdaBoost (Adapative Boosting), Gradient Boosting and XGBoost (Extreme Gradient Boosting). These methods differ in the way the net errors from the predecessor trees are addressed. 

How does it work?
Fig 8: AdaBoost Boosting Algorithm

Fig 8: AdaBoost Boosting Algorithm (source)

AdaBoost

AdaBoost is mainly used for classification tasks. This method assigns weights to the training instances. In each iteration of training a decision tree, assess the performance of weak learner, identifies mis-classfied training instances and adjusts their weight such that the next decision trees pay more attention to correctly classify them. This process is repeated until the stopping condition is reached.

Gradient Boosting

Gradient Boosting also works on the same principle as AdaBoost but differs in the way it operates on the “weakness” of the previous learners. Instead of adjusting the weights of training instances, it trains the successive decision trees on the residual error of the previous trees. This process is repeated until a stopping condition is reached. This method can be used for both classification and regression tasks.

XGBoost

XGBoost is an optimised version of Gradient Boosting that leverages distributed training using multiple CPU cores. This allows for training to occur in parallel and speeds up the computational process. It also includes regularization terms in the cost function which improves model generalization and addressed overfitting

Advantages and Disadvantages

Advantages:

  • Lesser data preprocessing steps as decision trees
  • Better predictive accuracy than Random Forest in many cases
  • Deals with model bias or underfitting
  • XGBoost improves computational efficiency by parallelising training task

Disadvantages:

  • Loses out on reducing the variance when using a large ensemble setup. This leads to poor generalization.
  • Boosting methods other than XGBoost are very computationally expensive.
Examples

Boosting has some major commercial applications, few of which are listed below:

  • Rainfall prediction using weather balloon data (source)
  • Search engine page ranking (source
  • Image retrieval (source)
Fig 9: Race Car images retrieved using a Boosting model

Fig 9: Race Car images retrieved using a Boosting model (source)

Conclusion

Tree-based algorithms make an important addition to a data science toolkit. With their competitive predictive power, interpretability and ease of deployment, these methods have a wide application base in a variety of industries. 

Want to work with us too?

media card
Top 10 Machine Learning Algorithms for Data Scientists (Including Real-World Case Studies)