Data: Titanic Passenger Survival Data Set
We use the titanic
dataset with binary classification on survived
. First
of all we store the train and test data in two data frames and remove
all rows that contains NA
s:
# Store train and test data:
df_train = na.omit(titanic::titanic_train)
str(df_train)
#> 'data.frame': 714 obs. of 12 variables:
#> $ PassengerId: int 1 2 3 4 5 7 8 9 10 11 ...
#> $ Survived : int 0 1 1 1 0 0 0 1 1 1 ...
#> $ Pclass : int 3 1 3 1 3 1 3 3 2 3 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : chr "male" "female" "female" "female" ...
#> $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
#> $ SibSp : int 1 1 0 1 0 0 3 0 1 1 ...
#> $ Parch : int 0 0 0 0 0 0 1 2 0 1 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : chr "S" "C" "S" "S" ...
#> - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#> ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...
In the next step we transform the response to a factor with more intuitive levels:
Initializing Model
Due to the R6
API it is necessary to create a new class
object which gets the data, the target as character, and the used loss.
Note that it is important to give an initialized loss object:
cboost = Compboost$new(data = df_train, target = "Survived", oob_fraction = 0.3)
Use an initialized object for the loss gives the opportunity to use a loss initialized with a custom offset.
Adding Base-Learner
Adding new base-learners is also done by giving a character to indicate the feature. As second argument it is important to name an identifier for the factory since we can define multiple base-learner on the same source.
Numerical Features
For instance, we can define a spline and a linear base-learner of the same feature:
# Spline base-learner of age:
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
# Linear base-learner of age (degree = 1 with intercept is default):
cboost$addBaselearner("Age", "linear", BaselearnerPolynomial)
Additional arguments can be specified after naming the base-learner:
# Spline base-learner of fare:
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2,
n_knots = 14, penalty = 10, differences = 2)
For references to the base learner documentation see functionality at the project page.
Categorical Features
When adding categorical features we use a dummy coded representation with a ridge penalty:
cboost$addBaselearner("Sex", "categorical", BaselearnerCategoricalRidge)
Finally, we can check what factories are registered:
cboost$getBaselearnerNames()
#> [1] "Age_spline" "Age_linear" "Fare_spline" "Sex_categorical"
Define Logger
Time logger
This logger logs the elapsed time. The time unit can be one of
microseconds
, seconds
or minutes
.
The logger stops if max_time
is reached. But we do not use
that logger as stopper here:
cboost$addLogger(logger = LoggerTime, use_as_stopper = FALSE, logger_id = "time",
max_time = 0, time_unit = "microseconds")
Train Model and Access Elements
cboost$train(2000, trace = 250)
#> 1/2000 risk = 0.68 oob_risk = 0.66 time = 0
#> 250/2000 risk = 0.5 oob_risk = 0.5 time = 26400
#> 500/2000 risk = 0.48 oob_risk = 0.48 time = 55487
#> 750/2000 risk = 0.47 oob_risk = 0.48 time = 87644
#> 1000/2000 risk = 0.47 oob_risk = 0.48 time = 125157
#> 1250/2000 risk = 0.47 oob_risk = 0.48 time = 169462
#> 1500/2000 risk = 0.47 oob_risk = 0.48 time = 209153
#> 1750/2000 risk = 0.47 oob_risk = 0.48 time = 250716
#> 2000/2000 risk = 0.46 oob_risk = 0.48 time = 294164
#>
#>
#> Train 2000 iterations in 0 Seconds.
#> Final risk based on the train set: 0.46
cboost
#>
#>
#> Component-Wise Gradient Boosting
#>
#> Target variable: Survived
#> Number of base-learners: 4
#> Learning rate: 0.05
#> Iterations: 2000
#>
#> Offset: 0.3392
#>
#> LossBinomial: L(y,x) = log(1 + exp(-2yf(x))
Objects of the Compboost
class do have member functions
such as getCoef()
, getInbagRisk()
or
predict()
to access the results:
str(cboost$getCoef())
#> List of 4
#> $ Age_spline : num [1:24, 1] -4.168 -1.131 -1.397 1.081 0.462 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Fare_spline : num [1:17, 1] 1.015 0.322 -0.527 -1.705 -1.41 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Sex_categorical: num [1:2, 1] 0.89 -1.39
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "male" "female"
#> .. ..$ : NULL
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#> $ offset : num 0.339
str(cboost$getInbagRisk())
#> num [1:2001] 0.679 0.676 0.672 0.669 0.666 ...
str(cboost$predict())
#> num [1:500, 1] 2.104 -2.062 -0.467 -2.088 1.312 ...
To obtain a vector of selected base learners use
getSelectedBaselearner()
:
table(cboost$getSelectedBaselearner())
#>
#> Age_spline Fare_spline Sex_categorical
#> 1145 521 334
We can also access predictions directly from the response object
cboost$response
and cboost$response_oob
. Note
that $response_oob
was created automatically when defining
an oob_fraction
within the constructor:
oob_label = cboost$response_oob$getResponse()
oob_pred = cboost$response_oob$getPredictionResponse()
table(true_label = oob_label, predicted = oob_pred)
#> predicted
#> true_label -1 1
#> -1 55 27
#> 1 17 115
Retrain the Model
To continue the training or set the whole model to another iteration
simply re-call train()
:
cboost$train(3000)
#>
#> You have already trained 2000 iterations.
#> Train 1000 additional iterations.
#>
#> 2025/3000 risk = 0.46 oob_risk = 0.48 time = 298880
#> 2100/3000 risk = 0.46 oob_risk = 0.48 time = 312730
#> 2175/3000 risk = 0.46 oob_risk = 0.48 time = 327079
#> 2250/3000 risk = 0.46 oob_risk = 0.48 time = 342173
#> 2325/3000 risk = 0.46 oob_risk = 0.48 time = 356363
#> 2400/3000 risk = 0.46 oob_risk = 0.48 time = 371352
#> 2475/3000 risk = 0.46 oob_risk = 0.48 time = 385944
#> 2550/3000 risk = 0.46 oob_risk = 0.48 time = 400836
#> 2625/3000 risk = 0.46 oob_risk = 0.48 time = 415903
#> 2700/3000 risk = 0.46 oob_risk = 0.49 time = 431397
#> 2775/3000 risk = 0.46 oob_risk = 0.49 time = 446305
#> 2850/3000 risk = 0.46 oob_risk = 0.49 time = 462330
#> 2925/3000 risk = 0.46 oob_risk = 0.49 time = 478116
#> 3000/3000 risk = 0.46 oob_risk = 0.49 time = 494244
str(cboost$getCoef())
#> List of 4
#> $ Age_spline : num [1:24, 1] -5.693 -0.779 -1.667 1.253 0.391 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Fare_spline : num [1:17, 1] 0.973 0.344 -0.504 -1.826 -1.444 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Sex_categorical: num [1:2, 1] 0.905 -1.409
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "male" "female"
#> .. ..$ : NULL
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#> $ offset : num 0.339
str(cboost$getInbagRisk())
#> num [1:3001] 0.679 0.676 0.672 0.669 0.666 ...
table(cboost$getSelectedBaselearner())
#>
#> Age_spline Fare_spline Sex_categorical
#> 1931 695 374
Next steps
- Have a look at the visualization capabilities of the package.
- See how other loss functions effect the model training.