Use-case • compboost

Data: Titanic Passenger Survival Data Set

We use the titanic dataset with binary classification on survived. First of all we store the train and test data in two data frames and remove all rows that contains NAs:

# Store train and test data:
df_train = na.omit(titanic::titanic_train)

str(df_train)
#> 'data.frame':    714 obs. of  12 variables:
#>  $ PassengerId: int  1 2 3 4 5 7 8 9 10 11 ...
#>  $ Survived   : int  0 1 1 1 0 0 0 1 1 1 ...
#>  $ Pclass     : int  3 1 3 1 3 1 3 3 2 3 ...
#>  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#>  $ Sex        : chr  "male" "female" "female" "female" ...
#>  $ Age        : num  22 38 26 35 35 54 2 27 14 4 ...
#>  $ SibSp      : int  1 1 0 1 0 0 3 0 1 1 ...
#>  $ Parch      : int  0 0 0 0 0 0 1 2 0 1 ...
#>  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#>  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Cabin      : chr  "" "C85" "" "C123" ...
#>  $ Embarked   : chr  "S" "C" "S" "S" ...
#>  - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#>   ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...

In the next step we transform the response to a factor with more intuitive levels:

df_train$Survived = factor(df_train$Survived, labels = c("no", "yes"))

Initializing Model

Due to the R6 API it is necessary to create a new class object which gets the data, the target as character, and the used loss. Note that it is important to give an initialized loss object:

cboost = Compboost$new(data = df_train, target = "Survived", oob_fraction = 0.3)

Use an initialized object for the loss gives the opportunity to use a loss initialized with a custom offset.

Adding Base-Learner

Adding new base-learners is also done by giving a character to indicate the feature. As second argument it is important to name an identifier for the factory since we can define multiple base-learner on the same source.

Numerical Features

For instance, we can define a spline and a linear base-learner of the same feature:

# Spline base-learner of age:
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)

# Linear base-learner of age (degree = 1 with intercept is default):
cboost$addBaselearner("Age", "linear", BaselearnerPolynomial)

Additional arguments can be specified after naming the base-learner:

# Spline base-learner of fare:
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2,
  n_knots = 14, penalty = 10, differences = 2)

For references to the base learner documentation see functionality at the project page.

Categorical Features

When adding categorical features we use a dummy coded representation with a ridge penalty:

cboost$addBaselearner("Sex", "categorical", BaselearnerCategoricalRidge)

Finally, we can check what factories are registered:

cboost$getBaselearnerNames()
#> [1] "Age_spline"      "Age_linear"      "Fare_spline"     "Sex_categorical"

Define Logger

Time logger

This logger logs the elapsed time. The time unit can be one of microseconds, seconds or minutes. The logger stops if max_time is reached. But we do not use that logger as stopper here:

cboost$addLogger(logger = LoggerTime, use_as_stopper = FALSE, logger_id = "time",
  max_time = 0, time_unit = "microseconds")

Train Model and Access Elements

cboost$train(2000, trace = 250)
#>    1/2000   risk = 0.68  oob_risk = 0.66   time = 0   
#>  250/2000   risk = 0.5  oob_risk = 0.5   time = 26400   
#>  500/2000   risk = 0.48  oob_risk = 0.48   time = 55487   
#>  750/2000   risk = 0.47  oob_risk = 0.48   time = 87644   
#> 1000/2000   risk = 0.47  oob_risk = 0.48   time = 125157   
#> 1250/2000   risk = 0.47  oob_risk = 0.48   time = 169462   
#> 1500/2000   risk = 0.47  oob_risk = 0.48   time = 209153   
#> 1750/2000   risk = 0.47  oob_risk = 0.48   time = 250716   
#> 2000/2000   risk = 0.46  oob_risk = 0.48   time = 294164   
#> 
#> 
#> Train 2000 iterations in 0 Seconds.
#> Final risk based on the train set: 0.46
cboost
#> 
#> 
#> Component-Wise Gradient Boosting
#> 
#> Target variable: Survived
#> Number of base-learners: 4
#> Learning rate: 0.05
#> Iterations: 2000
#> 
#> Offset: 0.3392
#> 
#> LossBinomial: L(y,x) = log(1 + exp(-2yf(x))

Objects of the Compboost class do have member functions such as getCoef(), getInbagRisk() or predict() to access the results:

str(cboost$getCoef())
#> List of 4
#>  $ Age_spline     : num [1:24, 1] -4.168 -1.131 -1.397 1.081 0.462 ...
#>   ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#>  $ Fare_spline    : num [1:17, 1] 1.015 0.322 -0.527 -1.705 -1.41 ...
#>   ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#>  $ Sex_categorical: num [1:2, 1] 0.89 -1.39
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "male" "female"
#>   .. ..$ : NULL
#>   ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#>  $ offset         : num 0.339
str(cboost$getInbagRisk())
#>  num [1:2001] 0.679 0.676 0.672 0.669 0.666 ...
str(cboost$predict())
#>  num [1:500, 1] 2.104 -2.062 -0.467 -2.088 1.312 ...

To obtain a vector of selected base learners use getSelectedBaselearner():

table(cboost$getSelectedBaselearner())
#> 
#>      Age_spline     Fare_spline Sex_categorical 
#>            1145             521             334

We can also access predictions directly from the response object cboost$response and cboost$response_oob. Note that $response_oob was created automatically when defining an oob_fraction within the constructor:

oob_label = cboost$response_oob$getResponse()
oob_pred = cboost$response_oob$getPredictionResponse()
table(true_label = oob_label, predicted = oob_pred)
#>           predicted
#> true_label  -1   1
#>         -1  55  27
#>         1   17 115

Retrain the Model

To continue the training or set the whole model to another iteration simply re-call train():

cboost$train(3000)
#> 
#> You have already trained 2000 iterations.
#> Train 1000 additional iterations.
#> 
#> 2025/3000   risk = 0.46  oob_risk = 0.48   time = 298880   
#> 2100/3000   risk = 0.46  oob_risk = 0.48   time = 312730   
#> 2175/3000   risk = 0.46  oob_risk = 0.48   time = 327079   
#> 2250/3000   risk = 0.46  oob_risk = 0.48   time = 342173   
#> 2325/3000   risk = 0.46  oob_risk = 0.48   time = 356363   
#> 2400/3000   risk = 0.46  oob_risk = 0.48   time = 371352   
#> 2475/3000   risk = 0.46  oob_risk = 0.48   time = 385944   
#> 2550/3000   risk = 0.46  oob_risk = 0.48   time = 400836   
#> 2625/3000   risk = 0.46  oob_risk = 0.48   time = 415903   
#> 2700/3000   risk = 0.46  oob_risk = 0.49   time = 431397   
#> 2775/3000   risk = 0.46  oob_risk = 0.49   time = 446305   
#> 2850/3000   risk = 0.46  oob_risk = 0.49   time = 462330   
#> 2925/3000   risk = 0.46  oob_risk = 0.49   time = 478116   
#> 3000/3000   risk = 0.46  oob_risk = 0.49   time = 494244

str(cboost$getCoef())
#> List of 4
#>  $ Age_spline     : num [1:24, 1] -5.693 -0.779 -1.667 1.253 0.391 ...
#>   ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#>  $ Fare_spline    : num [1:17, 1] 0.973 0.344 -0.504 -1.826 -1.444 ...
#>   ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#>  $ Sex_categorical: num [1:2, 1] 0.905 -1.409
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "male" "female"
#>   .. ..$ : NULL
#>   ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#>  $ offset         : num 0.339
str(cboost$getInbagRisk())
#>  num [1:3001] 0.679 0.676 0.672 0.669 0.666 ...
table(cboost$getSelectedBaselearner())
#> 
#>      Age_spline     Fare_spline Sex_categorical 
#>            1931             695             374

Next steps

Have a look at the visualization capabilities of the package.
See how other loss functions effect the model training.