`about_cboosting.Rmd`

Instead of getting huge predictive power, model-based (or component-wise) boosting aggregates statistical models to maintain interpretability. This is done by restricting the used base-learners to be linear in the parameters. This is important in terms of parameter estimation. For instance, having a base-learner \(b_j(x, \theta^{[m]})\) of a feature \(x\) with a corresponding parameter vector \(\theta^{[m]}\) and another base-learner \(b_j\) of the same type but with a different parameter vector \(\theta^{[m^\prime]}\), then it is possible, due to linearity, to combine those two learners to a new one, again, of the same type: \[
b_j(x, \theta^{[m]}) + b_j(x, \theta^{[m^\prime]}) = b_j(x, \theta^{[m]} + \theta^{[m^\prime]})
\] So, instead of boosting trees like `xgboost`

, model-based boosting uses a selection of important base-learners, each with estimated parameters.

The ordinary way of finding good base-learners is done by a greedy search. Therefore, in each iteration, all base-learners are fitted to the so-called pseudo residuals in the current iteration that act as a kind of error of the actual model. A new base-learner is then selected by choosing the base-learner with the smallest empirical risk \(\mathcal{R}_\text{emp}\).

The illustration above uses three base-learners \(b_1\), \(b_2\), and \(b_3\). You can think of each base-learner as a wrapper around a feature or category which represents the effect of that feature on the target variable. Therefore, if a base-learner is selected more frequently than another base-learner, it indicates that this base-learner (or feature) is more important. To get a sparser model in terms of selected features there is also a learning rate \(\beta\) to shrink the parameter in each iteration. The learning rate corresponds to the step size used in gradient descent. The three models within each iteration of the illustration are: \[ \begin{align} \text{Iteration 1:} \ &\hat{f}^{[1]}(x) = f_0 + \beta b_3(x_3, \theta^{[1]}) \\ \text{Iteration 2:} \ &\hat{f}^{[2]}(x) = f_0 + \beta b_3(x_3, \theta^{[1]}) + \beta b_3(x_3, \theta^{[2]}) \\ \text{Iteration 3:} \ &\hat{f}^{[3]}(x) = f_0 + \beta b_3(x_3, \theta^{[1]}) + \beta b_3(x_3, \theta^{[2]}) + \beta b_1(x_1, \theta^{[3]}) \end{align} \] Using the linearity of the base-learners it is possible to aggregate the \(b_3\) base-learners: \[ \hat{f}^{[3]}(x) = f_0 + \beta \left( b_3(x_3, \theta^{[1]} + \theta^{[2]}) + b_1(x_1, \theta^{[3]}) \right) \]

This is a very simple example but it displays the main strength of model-based boosting such as an inherent variable selection very well. Since the fitting is done iteratively on the single base-learners, model-based boosting is also a very efficient approach for data situations where \(p \gg n\), which is learning in high-dimensional feature spaces. For instance, genomic data where the number of features \(p\) could be several thousand but the number of observations are not more than a few hundred.