Gradient Boosted Machines have gained a lot of traction in recent years due to their accuracy and their flexibility (in terms of hyperparameters and loss function choices). They have also gained popularity in online competitions because their performance on tabular data is really good, they require fewer data preprocessing and they can handle missing data. There are so many articles on the web about XGBoost and a few more about LightGBM and CatBoost. There are even articles that compare them in pairs. But few posts compare the three together in a comprehensive way. This article pretends to do exactly that. In this article, we will go over a quick introduction to Gradient-boosting Machines and the main pros and cons of the models mentioned. We also will be pointing out the fundamental differences between XGBoost, CatBoost, and LightGBM and giving a summary chart to make it easy for our readers to decide which model adapts better to their project.
When you are looking for a model for your project, you may want to focus on the quality and the speed of the algorithm, if it handles categorical data or not, its performance on your type of data, etc. In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain a better predictive performance than the one that could be obtained from any of the constituent learning algorithms alone. The idea is to average out different models’ mistakes to reduce the risk of overfitting while maintaining strong prediction performance. In particular, tree ensemble models consist of a set of classification and regression trees (CART).
In regression, overall prediction is typically the mean of individual tree predictions, whereas in classification, overall predictions are based on a weighted vote with probabilities averaged across all trees, and the class with the highest probability is the final predicted class. There are two main classes of ensemble learning methods: *bagging* and *boosting*.
Gradient Boosted Machines
The main component of gradient-boosted machines is decision trees. Gradient boosting decision tree (GBDT) is a widely-used machine learning algorithm, due to its efficiency, accuracy, and interpretability. GBDT achieves state-of-the-art performances in many machine learning tasks, such as multi-class classification, click prediction, and learning to rank. In recent years, with the emergence of big data (in terms of both the number of features and the number of instances).
GBDT is facing new challenges, especially in the tradeoff between accuracy and efficiency. Conventional implementations of GBDT need to, for every feature, scan all the data instances to estimate the information gain of all the possible split points. Therefore, their computational complexities will be proportional to both the number of features and the number of instances. This makes these implementations very time-consuming when handling big data.
The algorithms presented in this article differ from one another in the implementation of the boosted trees algorithm, and their technical compatibilities and limitations. XGBoost was the first to try improving GBM’s training time, followed by LightGBM and CatBoost, each one with their own techniques.
You can also take a look to this interesting article about decision trees.
It is also important to know that boosting is a sequential technique that works on the principle of an ensemble. In order to understand gradient boosting, we can start by understanding the AdaBoost algorithm. This algorithm begins by training a decision tree in which each observation is assigned an equal weight. After evaluating the first tree, it increases the weights of those observations that are difficult to classify and lowers the weights for those that are easy to classify. The second tree is, therefore, grown on this weighted data. So our new model is Tree1 + Tree2. It then computes the classification error from this 2-tree ensemble model and grows a third tree to predict the revised residuals.
This process is repeated in a specified number of iterations. This way, the subsequent trees help to classify observations that are not well classified by the previous trees. Predictions of the final ensemble model are therefore the weighted sum of the predictions made by the previous tree models.
Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
XGBoost stands for eXtreme Gradient Boosting and it is the model most commonly used for prediction problems. It’s been a long time since this model appeared, growing up into a big community of support and investigation. Many academic pages and courses recommend this type of algorithm, offering a large number of tutorials to learn about its use and potential.
Its parallel performance at the time of training is slow compared to the other algorithms. It also gets slower and requires too much memory consumption when handling very large datasets, but there are some parameters to optimize this point, such as colsample_bytree, subsample, n_estimators, etc.. Despite this, the accuracy reached doesn’t decrease, even using large or small datasets.
Regarding its coverage, multiple datasets can be used with this model, even datasets with null values, but it doesn’t support categorical features. This issue can be countered with target encoding or one-hot encoding, but this has to be added manually due to the model doesn’t provide it.
Like the other models in this article, XGBoost can solve multiclass classification problems. In order to do this, you need to change the value of two parameters: num_class and objective. In contrast with CatBoost and LightGBM, for the multilabel classification, you need to combine it with some other tools, as he does in this article.
CatBoost stands for Category Boosting and it is the newest gradient boosting library among the open-source ones we are discussing here.
In CatBoost, the trees are symmetric or balanced, which means that the splitting condition is consistent across all nodes at the same depth of the tree. This condition must result in a lower loss across those nodes. In this algorithm, a list of possible candidates of feature-split pairs is assigned to the leaf as the split and the split that results in the smallest penalty is the one selected. The benefits of balanced trees include faster computation and controlled overfitting.
When we work with decision trees, it is easy to work with numerical features because we can split them into “greater than something” and “smaller than something”, but it’s not so easy to work with categorical features. The good thing about CatBoost is that it handles categorical data out of the box, so as opposed to other libraries, it does numerical and categorical features. Moreover, it does categorical features really well.
Something good that comes with the experience of working with CatBoost is that the default parameters are really good. They also work with specific kinds of trees that help the resulting quality be less different when we change the parameters. That’s an important and useful thing because you don’t have to think about tuning the parameters and you can focus on analyzing the data and make feature engineering.
CatBoost also comes with model analyzing tools that help you to understand why your model is giving you a particular prediction.
First, we need to understand that everything depends on the dataset. Regarding quality, without going into details, there are a lot of public datasets in binary classification problems where CatBoost presents a higher quality than XGBoost and LightGBM. With respect to learning speed, on large dense datasets with a lot of features CatBoost could be the fastest by far and with small datasets, it could be the slowest on CPU. But in terms of GPU, CatBoost is the best of all the existing libraries (this is an important thing to consider, due to GPU is super useful for gradient boosting, you can get a speed-up up to 50 times and in multi-classification problems, it could be a 100 times).
In terms of prediction time, CatBoost is 30 to 60 times faster than XGBoost and LightGBM. (Even, on Yandex some applications use GPU CatBoost.)
Sadly it is a new library, and the release date dates from 2017, so the community is still small, there are not many posts about this and the documentation is quite difficult to read. But it is a useful tool (in future articles we will mention one case where CatBoost was an important component for an ensemble for a Kaggle competition).
LightGBM, also known as Light Gradient Boosted Machine, also represents a model that works as an asymmetric decision tree, but in this case, unlike XGBoost, this tree grows leaf-wise (horizontally) while XGBoost grows level-wise (vertically), resulting in smaller and faster models. Unfortunately, such as CatBoost, LightGBM has a small community which makes it a bit harder to work with or to use advanced features compared to XGBoost.
This model uses histogram-based algorithms, which bucket continuous feature (attribute) values into discrete bins. This speeds up training and reduces memory usage by reducing the costs of calculating the gain for each split, using ****histogram subtraction for further speedup, reducing communication costs for distributed learning, and replacing continuous values with discrete bins.
Multiple applications are supported by LightGBM. This includes:
- regression, the objective function is L2 loss
- binary classification, the objective function is logloss
- cross-entropy, the objective function is logloss and supports training on non-binary labels
- LambdaRank, the objective function is LambdaRank with NDCG
Regarding the categorical feature support, it offers good accuracy with integer-encoded categorical features. The Fisher method applied finds the optimal solution by splitting on a categorical feature by partitioning its categories into 2 subsets. The idea is to sort the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then find the best split on the sorted histogram, which often performs better than one-hot encoding.
This model also provides a missing values handler by default, using NA (NaN) to represent missing values. This can be changed into using zero with the parameter zero_as_missing set to ‘True’.
The table below is a summary of the differences between the characteristics of the three models: