Before Starting

  • Read the use-case to get to know how to define a Compboost object using the R6 interface

Data: Titanic Passenger Survival Data Set

We use the titanic dataset with binary classification on Survived. First of all we store the train and test data into two data frames and remove all rows that contains missing values (NAs):

# Store train and test data:
df = na.omit(titanic::titanic_train)
df$Survived = factor(df$Survived, labels = c("no", "yes"))

For the later stopping we split the dataset into train and test:

set.seed(123)
idx_train = sample(seq_len(nrow(df)), size = nrow(df) * 0.8)
idx_test = setdiff(seq_len(nrow(df)), idx_train)

Early Stopping in Compboost

How does it work?

The early stopping of compboost is done by using the logger objects. The logger is executed after each iteration and stores class dependent data, e.g. the runtime. Additionally, each logger can be declared as a stopper with use_as_stopper = TRUE. Declaring a logger as stopper, the logged data is used to stop the algorithm after a logger-specific criteria is reached. For example, using LoggerTime as stopper will break the algorithm algorithm after a pre-defined runtime is reached.

Example with runtime stopping

Now it is time to define a logger to track the runtime. As mentioned above, we set use_as_stopper = TRUE. Now it matters what is specified in max_time since this defines how long we like to train the model. Here we want to stop after 50000 microseconds:

As we can see, the fittings is stopped after 357 and not after 2000 iterations as specified in train. Taking a look at the logger data, we can see that the last entry exceeds the 50000 microseconds and therefore triggers the stopping criteria:

Loss-Based Early Stopping

In machine learning we often like to stop when the best model performance is reached. Especially in boosting, which may tend to overfit, we need either tuning or early stopping to determine what is a good number of iterations \(m\) to get a good model performance. A well-known procedure is to log the out of bag (oob) behavior of the model and stop after this starts to get worse. This is how the oob early stopping is implemented in compboost. The parameter we need to specify are

  • the loss \(L\) that is used for stopping: \[\mathcal{R}_{\text{emp}}^{[m]} = \frac{1}{n}\sum_{i=1}^n L\left(y^{(i)}, f^{[m]}(x^{(i)})\right)\]

  • the percentage of performance increase that should be undershot: \[\text{err}^{[m]} = \frac{\mathcal{R}_{\text{emp}}^{[m- 1]} - \mathcal{R}_{\text{emp}}^{[m]}}{\mathcal{R}_{\text{emp}}^{[m - 1]}}\]

Define the risk logger

Since we are interested in the oob behavior it is necessary to define the oob data and response in a manner that compboost understands it. Therefore, it is possible to use the $prepareResponse() and $prepareData() member functions to create suitable objects:

With these objects we can add the oob risk logger, declare it as stopper, and train the model:

Note: The use of eps_for_break = 0 is a hard constrain to continue the training just until the oob risk starts to increase.

Taking a look at the logger data tells us that we stop exactly after the first five differences are bigger than zero (the oob risk of these iterations are bigger than the previous ones):

library(ggplot2)

ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
  geom_line() +
  xlab("Iteration") +
  ylab("Empirical Risk")

Taking a look at 2000 iterations shows that we have stopped quite good:

cboost$train(2000, trace = 0)
#> 
#> You have already trained 398 iterations.
#> Train 1602 additional iterations.

ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
  geom_line() +
  xlab("Iteration") +
  ylab("Empirical Risk")

Note: It could happen that the model’s oob behavior increases locally for a few iterations and then starts to decrease again. To capture this we need the “patience” parameter which waits for, lets say, 5 iterations and breaks just if all 5 iterations fulfill the criteria. Setting this parameter to one can lead to very unstable results:

Further comments on risk logging

  • Since we can define as many as logger as we like it is possible to define multiple risk logger regarding different loss functions.
  • It is also possible to log performance measures with the risk logging mechanism. This is covered as advanced topic.

Some remarks

  • Early stopping can be done globally or locally:
    • locally: The algorithm stops after the first stopping criteria of any logger is reached
    • globally: The algorithm stops after all stopping criteria are reached
  • Some arguments are ignored when the logger is not set as stopper, e.g. max_time from the time logger
  • The logger functionality is summarized here