Before Starting

  • Read the use-case to get to know how to define a Compboost object using the R6 interface

Data: Titanic Passenger Survival Data Set

We use the titanic dataset with binary classification on Survived. First of all we store the train and test data in two data frames and remove all rows that contains NAs:

# Store train and test data:
df = na.omit(titanic::titanic_train)
df$Survived = factor(df$Survived, labels = c("no", "yes"))

For the later stopping we split the dataset into train and test:

set.seed(123)
idx_train = sample(seq_len(nrow(df)), size = nrow(df) * 0.8)
idx_test = setdiff(seq_len(nrow(df)), idx_train)

Early Stopping in Compboost

How does it work?

The early stopping of compboost is done using logger objects. The logger is executed after each iteration and stores class dependent data, e.g. the runtime. Additionally, each logger can be declared as a stopper with use_as_stopper = TRUE. Declaring a logger as stopper, the logged data is used to stop the algorithm after a logger-specific criteria is reached. For example, using LoggerTime as stopper, the algorithm can be stopped after a pre-defined runtime is reached.

Example with runtime stopping

Now it’s time to define a logger to log the runtime. As mentioned above, we set use_as_stopper = TRUE. Now it matters what is specified in max_time since this defines the stopping behavior. Here we want to stop after 50000 microseconds:

As we can see, the fittings is stopped after 377 and not after 2000 iterations as specified in train. Taking a look at the logger data, we can see that the last entry exceeds the 50000 microseconds and therefore hits the stopping criteria:

Loss-Based Early Stopping

In machine learning we often like to stop when the best model performance is reached. Especially in boosting, which tends to overfit, we need either tuning or early stopping to determine what is a good number of iteration \(m\) to get good model performance. A well-known procedure is to log the out of bag (oob) behavior of the model and stop after this starts to get worse. This is how oob early stopping is implemented in compboost. The freedom we have is to specify

  • the loss \(L\) that is used for stopping: \[\mathcal{R}_{\text{emp}}^{[m]} = \frac{1}{n}\sum_{i=1}^n L\left(y^{(i)}, f^{[m]}(x^{(i)})\right)\]

  • the percentage of performance increase that should be undershot: \[\text{err}^{[m]} = \frac{\mathcal{R}_{\text{emp}}^{[m- 1]} - \mathcal{R}_{\text{emp}}^{[m]}}{\mathcal{R}_{\text{emp}}^{[m - 1]}}\]

Define the risk logger

Since we are interested in the oob behavior it is necessary to define the oob data and response in a manner that compboost understands it. Therefore, it is possible to use the $prepareResponse() and $prepareData() to get the objects:

With that data we can add the oob risk logger, declare it as stopper, and train the model:

Note: The use of eps_for_break = 0 is a hard constrained that says that the training should continue until the oob risk starts to get bigger.

Taking a look at the logger data tells us that we stopped after the first 5 differences are bigger than zero (the oob risk of that iterations is bigger than the previous ones):

library(ggplot2)

ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
  geom_line() +
  xlab("Iteration") +
  ylab("Empirical Risk")

Taking a look at 2000 iterations shows that we have stopped quite good:

cboost$train(2000, trace = 0)
#> 
#> You have already trained 607 iterations.
#> Train 1393 additional iterations.

ggplot(data = cboost$getLoggerData(), aes(x = `_iterations`, y = oob)) +
  geom_line() +
  xlab("Iteration") +
  ylab("Empirical Risk")

Note: It could happen that the model’s oob behavior increases locally for a few iterations and then starts to decrease again. To capture this we would need the “patience” parameter which waits for, lets say, 5 iterations and breaks just if all 5 iterations fulfill the criteria. Setting this parameter to 1 can lead to very unstable results:

Further comments on risk logging

  • Since we can define as many as logger as we like it is possible to define multiple risk logger regarding different loss functions.
  • It is also possible to log performance measures with the risk logging mechanism. This is covered as advanced topic.

Some remarks

  • Early stopping can be done globally or locally:
    • locally: The algorithm stops after the first stopping criteria of any logger is reached
    • globally: The algorithm stops after all stopping criteria are reached
  • Some arguments are ignored when the logger is not set as stopper, e.g. max_time from the time logger
  • The logger functionality is summarized here