Use-case
use_case.Rmd
Data: Titanic Passenger Survival Data Set
We use the titanic
dataset with binary classification on survived
. First
of all we store the train and test data in two data frames and remove
all rows that contains NA
s:
# Store train and test data:
df_train = na.omit(titanic::titanic_train)
str(df_train)
#> 'data.frame': 714 obs. of 12 variables:
#> $ PassengerId: int 1 2 3 4 5 7 8 9 10 11 ...
#> $ Survived : int 0 1 1 1 0 0 0 1 1 1 ...
#> $ Pclass : int 3 1 3 1 3 1 3 3 2 3 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : chr "male" "female" "female" "female" ...
#> $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
#> $ SibSp : int 1 1 0 1 0 0 3 0 1 1 ...
#> $ Parch : int 0 0 0 0 0 0 1 2 0 1 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : chr "S" "C" "S" "S" ...
#> - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#> ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...
In the next step we transform the response to a factor with more intuitive levels:
Initializing Model
Due to the R6
API it is necessary to create a new class
object which gets the data, the target as character, and the used loss.
Note that it is important to give an initialized loss object:
cboost = Compboost$new(data = df_train, target = "Survived", oob_fraction = 0.3)
Use an initialized object for the loss gives the opportunity to use a loss initialized with a custom offset.
Adding Base-Learner
Adding new base-learners is also done by giving a character to indicate the feature. As second argument it is important to name an identifier for the factory since we can define multiple base-learner on the same source.
Numerical Features
For instance, we can define a spline and a linear base-learner of the same feature:
# Spline base-learner of age:
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
# Linear base-learner of age (degree = 1 with intercept is default):
cboost$addBaselearner("Age", "linear", BaselearnerPolynomial)
Additional arguments can be specified after naming the base-learner. For a complete list see the functionality at the project page:
# Spline base-learner of fare:
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2,
n_knots = 14, penalty = 10, differences = 2)
Categorical Features
When adding categorical features we use a dummy coded representation with a ridge penalty:
cboost$addBaselearner("Sex", "categorical", BaselearnerCategoricalRidge)
Finally, we can check what factories are registered:
cboost$getBaselearnerNames()
#> [1] "Age_spline" "Age_linear" "Fare_spline" "Sex_categorical"
Define Logger
Time logger
This logger logs the elapsed time. The time unit can be one of
microseconds
, seconds
or minutes
.
The logger stops if max_time
is reached. But we do not use
that logger as stopper here:
cboost$addLogger(logger = LoggerTime, use_as_stopper = FALSE, logger_id = "time",
max_time = 0, time_unit = "microseconds")
Train Model and Access Elements
cboost$train(2000, trace = 250)
#> 1/2000 risk = 0.68 oob_risk = 0.66 time = 0
#> 250/2000 risk = 0.48 oob_risk = 0.53 time = 24501
#> 500/2000 risk = 0.46 oob_risk = 0.53 time = 48120
#> 750/2000 risk = 0.45 oob_risk = 0.53 time = 75757
#> 1000/2000 risk = 0.45 oob_risk = 0.53 time = 104200
#> 1250/2000 risk = 0.45 oob_risk = 0.53 time = 135609
#> 1500/2000 risk = 0.45 oob_risk = 0.53 time = 168278
#> 1750/2000 risk = 0.45 oob_risk = 0.53 time = 202516
#> 2000/2000 risk = 0.45 oob_risk = 0.53 time = 238473
#>
#>
#> Train 2000 iterations in 0 Seconds.
#> Final risk based on the train set: 0.45
cboost
#>
#>
#> Component-Wise Gradient Boosting
#>
#> Target variable: Survived
#> Number of base-learners: 4
#> Learning rate: 0.05
#> Iterations: 2000
#>
#> Offset: 0.3392
#>
#> LossBinomial: L(y,x) = log(1 + exp(-2yf(x))
Objects of the Compboost
class do have member functions
such as getCoef()
, getInbagRisk()
or
predict()
to access the results:
str(cboost$getCoef())
#> List of 4
#> $ Age_spline : num [1:24, 1] -3.355 -1.713 -1.56 1.205 -0.456 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Fare_spline : num [1:17, 1] 1.667 0.086 -0.595 -1.201 -1.468 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Sex_categorical: num [1:2, 1] -1.58 0.95
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "female" "male"
#> .. ..$ : NULL
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#> $ offset : num 0.339
str(cboost$getInbagRisk())
#> num [1:2001] 0.679 0.675 0.671 0.668 0.664 ...
str(cboost$predict())
#> num [1:500, 1] -1.78 -2.21 0.98 -0.75 -1.32 ...
To obtain a vector of selected base learners use
getSelectedBaselearner()
:
table(cboost$getSelectedBaselearner())
#>
#> Age_spline Fare_spline Sex_categorical
#> 1245 376 379
We can also access predictions directly from the response object
cboost$response
and cboost$response_oob
. Note
that $response_oob
was created automatically when defining
an oob_fraction
within the constructor:
oob_label = cboost$response_oob$getResponse()
oob_pred = cboost$response_oob$getPredictionResponse()
table(true_label = oob_label, predicted = oob_pred)
#> predicted
#> true_label -1 1
#> -1 58 24
#> 1 26 106
Retrain the Model
To continue the training or set the whole model to another iteration
simply re-call train()
:
cboost$train(3000)
#>
#> You have already trained 2000 iterations.
#> Train 1000 additional iterations.
#>
#> 2025/3000 risk = 0.45 oob_risk = 0.53 time = 242503
#> 2100/3000 risk = 0.45 oob_risk = 0.54 time = 253553
#> 2175/3000 risk = 0.45 oob_risk = 0.54 time = 264804
#> 2250/3000 risk = 0.45 oob_risk = 0.54 time = 276485
#> 2325/3000 risk = 0.45 oob_risk = 0.54 time = 288327
#> 2400/3000 risk = 0.45 oob_risk = 0.54 time = 300412
#> 2475/3000 risk = 0.45 oob_risk = 0.54 time = 312436
#> 2550/3000 risk = 0.45 oob_risk = 0.54 time = 324750
#> 2625/3000 risk = 0.45 oob_risk = 0.54 time = 337132
#> 2700/3000 risk = 0.45 oob_risk = 0.54 time = 349831
#> 2775/3000 risk = 0.45 oob_risk = 0.54 time = 362701
#> 2850/3000 risk = 0.45 oob_risk = 0.54 time = 375667
#> 2925/3000 risk = 0.44 oob_risk = 0.54 time = 389482
#> 3000/3000 risk = 0.44 oob_risk = 0.54 time = 403167
str(cboost$getCoef())
#> List of 4
#> $ Age_spline : num [1:24, 1] -3.885 -1.464 -1.851 1.579 -0.911 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Fare_spline : num [1:17, 1] 1.7785 0.0722 -0.5992 -1.2552 -1.5382 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Sex_categorical: num [1:2, 1] -1.59 0.96
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "female" "male"
#> .. ..$ : NULL
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#> $ offset : num 0.339
str(cboost$getInbagRisk())
#> num [1:3001] 0.679 0.675 0.671 0.668 0.664 ...
table(cboost$getSelectedBaselearner())
#>
#> Age_spline Fare_spline Sex_categorical
#> 2098 490 412
Next steps
- Have a look at the visualization capabilities of the package.
- See how other loss functions effect the model training.