
Production mode
basic-production-model.Rmd
Storing the complete [Compboost] object requires to save a lot of data:
- Data matrices of the raw data.
- Transformed data matrices. Each base learner creates a design matrix with (potentially) multiple columns.
Hence, compboost
allows to store the model without the
data. Within this vignettes, this is also called production mode since
it is the more practical case when running the model in production.
Store model without data
To do so, just call:
dat = mlr3::tsk("sonar")$data()
cboost = boostSplines(dat, "Class", oob_fraction = 0.3)
#> 1/100 risk = 0.68 oob_risk = 0.68 time = 0
#> 2/100 risk = 0.67 oob_risk = 0.67 time = 1007
#> 4/100 risk = 0.66 oob_risk = 0.67 time = 2819
#> 6/100 risk = 0.66 oob_risk = 0.66 time = 4621
#> 8/100 risk = 0.65 oob_risk = 0.66 time = 6438
#> 10/100 risk = 0.65 oob_risk = 0.65 time = 8323
#> 12/100 risk = 0.64 oob_risk = 0.65 time = 10197
#> 14/100 risk = 0.63 oob_risk = 0.64 time = 12052
#> 16/100 risk = 0.63 oob_risk = 0.64 time = 13910
#> 18/100 risk = 0.62 oob_risk = 0.64 time = 15762
#> 20/100 risk = 0.62 oob_risk = 0.63 time = 17587
#> 22/100 risk = 0.61 oob_risk = 0.63 time = 19412
#> 24/100 risk = 0.61 oob_risk = 0.62 time = 21232
#> 26/100 risk = 0.6 oob_risk = 0.62 time = 23050
#> 28/100 risk = 0.6 oob_risk = 0.62 time = 24912
#> 30/100 risk = 0.59 oob_risk = 0.61 time = 26969
#> 32/100 risk = 0.59 oob_risk = 0.61 time = 28846
#> 34/100 risk = 0.59 oob_risk = 0.61 time = 30707
#> 36/100 risk = 0.58 oob_risk = 0.61 time = 32581
#> 38/100 risk = 0.58 oob_risk = 0.61 time = 34444
#> 40/100 risk = 0.57 oob_risk = 0.6 time = 36301
#> 42/100 risk = 0.57 oob_risk = 0.6 time = 38191
#> 44/100 risk = 0.57 oob_risk = 0.6 time = 40049
#> 46/100 risk = 0.56 oob_risk = 0.6 time = 42420
#> 48/100 risk = 0.56 oob_risk = 0.6 time = 44281
#> 50/100 risk = 0.56 oob_risk = 0.59 time = 46193
#> 52/100 risk = 0.55 oob_risk = 0.59 time = 48061
#> 54/100 risk = 0.55 oob_risk = 0.59 time = 49939
#> 56/100 risk = 0.55 oob_risk = 0.59 time = 51786
#> 58/100 risk = 0.54 oob_risk = 0.59 time = 53649
#> 60/100 risk = 0.54 oob_risk = 0.59 time = 55540
#> 62/100 risk = 0.54 oob_risk = 0.58 time = 57405
#> 64/100 risk = 0.54 oob_risk = 0.58 time = 59272
#> 66/100 risk = 0.53 oob_risk = 0.58 time = 61121
#> 68/100 risk = 0.53 oob_risk = 0.58 time = 62984
#> 70/100 risk = 0.53 oob_risk = 0.58 time = 64842
#> 72/100 risk = 0.52 oob_risk = 0.58 time = 66717
#> 74/100 risk = 0.52 oob_risk = 0.57 time = 68582
#> 76/100 risk = 0.52 oob_risk = 0.57 time = 70456
#> 78/100 risk = 0.52 oob_risk = 0.57 time = 72308
#> 80/100 risk = 0.51 oob_risk = 0.57 time = 74179
#> 82/100 risk = 0.51 oob_risk = 0.57 time = 76027
#> 84/100 risk = 0.51 oob_risk = 0.57 time = 77882
#> 86/100 risk = 0.51 oob_risk = 0.57 time = 79713
#> 88/100 risk = 0.5 oob_risk = 0.56 time = 81639
#> 90/100 risk = 0.5 oob_risk = 0.56 time = 83459
#> 92/100 risk = 0.5 oob_risk = 0.56 time = 85295
#> 94/100 risk = 0.5 oob_risk = 0.56 time = 87130
#> 96/100 risk = 0.49 oob_risk = 0.56 time = 88958
#> 98/100 risk = 0.49 oob_risk = 0.56 time = 90788
#> 100/100 risk = 0.49 oob_risk = 0.56 time = 92609
#>
#>
#> Train 100 iterations in 0 Seconds.
#> Final risk based on the train set: 0.49
file = "cboost.json"
cboost$saveToJson(file, rm_data = TRUE)
cboost_without_data = Compboost$new(file = file)
# The data field now just contains a dummy:
cboost_without_data$data
#> V1 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V2 V20 V21 V22 V23 V24 V25 V26 V27
#> 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> V28 V29 V3 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V4 V40 V41 V42 V43 V44 V45
#> 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> V46 V47 V48 V49 V5 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V6 V60 V7 V8 V9
#> 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Note: It is not possible to use any functionality
that requires the training data when storing and loading the object
without data. For example, cboost$predict()
now throws an
error:
cboost_without_data$predict()
#> Error in eval(expr, envir, enclos): Production mode is on, this does not allow prediction on training data and hence also blocks the continuation of the training. This is most likely because the training data was removed to either store memory or due to privacy reasons.
Functionality of a data free model
The most important functions are still usable:
Predict on new data
ndat = dat[1:10, ]
cboost_without_data$predict(ndat)
#> [,1]
#> [1,] -0.17232799
#> [2,] 0.18369305
#> [3,] -0.30361262
#> [4,] -0.02937112
#> [5,] 0.64950025
#> [6,] 0.38152391
#> [7,] 0.10316929
#> [8,] 0.98339434
#> [9,] 0.07043071
#> [10,] -0.53720517
Visualize partial feature effects.
library(patchwork)
# Use most important base learner:
bln = vip$baselearner[1]
plotBaselearner(cboost_without_data, bln) +
plotPEUni(cboost_without_data, strsplit(bln, "_")[[1]][1])
Get logger data
head(cboost_without_data$getLoggerData())
#> _iterations oob_risk time baselearner train_risk
#> 1 0 NA NA intercept 0.6795747
#> 2 1 0.6771422 0 V11_spline 0.6757390
#> 3 2 0.6742815 1007 V11_spline 0.6719943
#> 4 3 0.6714987 1931 V11_spline 0.6683382
#> 5 4 0.6687920 2819 V11_spline 0.6647688
#> 6 5 0.6661591 3719 V11_spline 0.6612840
Setting the model to a previous iteration.
table(cboost_without_data$getSelectedBaselearner())
#>
#> V10_spline V11_spline V12_spline V21_spline V36_spline V48_spline V49_spline
#> 13 20 24 21 11 3 8
cboost_without_data$predict(ndat)
#> [,1]
#> [1,] -0.17232799
#> [2,] 0.18369305
#> [3,] -0.30361262
#> [4,] -0.02937112
#> [5,] 0.64950025
#> [6,] 0.38152391
#> [7,] 0.10316929
#> [8,] 0.98339434
#> [9,] 0.07043071
#> [10,] -0.53720517
# State after 50 iteration:
cboost_without_data$train(50)
table(cboost_without_data$getSelectedBaselearner())
#>
#> V10_spline V11_spline V12_spline V21_spline V36_spline
#> 6 19 15 9 1
cboost_without_data$predict(ndat)
#> [,1]
#> [1,] -0.02334412
#> [2,] 0.28855992
#> [3,] -0.05161347
#> [4,] -0.06229294
#> [5,] 0.60796609
#> [6,] 0.51006267
#> [7,] 0.14361074
#> [8,] 0.79343112
#> [9,] -0.03531843
#> [10,] -0.29024662
Advantages
Loading
Loading a model is much faster (maybe not that striking for smaller models):
system.time(Compboost$new(file = file))
#> user system elapsed
#> 0.101 0.000 0.101
system.time(Compboost$new(file = file_full))
#> user system elapsed
#> 0.145 0.004 0.149