Component-wise boosting

Fit a component-wise boosting model (Buehlmann (2003)). This class wraps the S4 class system with Compboost_internal as internal model representation exposed by Rcpp. The two convenient wrapper boostLinear() and boostSplines() are also creating objects of this class.

Visualizing the internals see plotBaselearnerTraces(), plotBaselearner(), plotFeatureImportance(), plotPEUni(), plotTensor(), and plotRisk(). Visualizing the contribution for one new observation see plotIndividualContribution().

References

Buehlmann, Peter, Yu, Bin (2003). “Boosting with the L2 loss: regression and classification.” Journal of the American Statistical Association, 98(462), 324--339. doi:10.1198/016214503000125 .

Public fields

data: (data.frame())
The data used for training the model. Note: If oob_fraction is set, the input data is split into data and data_oob. Hence, data contains a subset of the input data to train the model.
data_oob: (data.frame())
An out-of-bag data set used for risk logging or early stopping. data_oob is split from the input data (see the data field).
oob_fraction: (numeric(1))
The fraction of nrow(input data) defining the number of observations in data_oob.
response: (ResponseRegr | ResponseBinaryClassif)
A S4 response object. See ?ResponseRegr or ?ResponseBinaryClassif for help. This object holds the current prediction, pseudo residuals and functions to transform scores. Note: This response corresponds to the data field and holds the predictions for that data.frame.
response_oob: (ResponseRegr | ResponseBinaryClassif)
A S4 response object. See ?ResponseRegr or ?ResponseBinaryClassif for help. Same as response but for data_oob.
target: (character(1))
Name of the target variable in data.
id: (character(1))
Name of the data object defined in $new(data, ...).
optimizer: (OptimizerCoordinateDescent | OptimizerCoordinateDescentLineSearch | OptimizerAGBM | OptimizerCosineAnnealing)
An initialized S4 optimizer object (requires to call Optimizer*$new(..). See the respective help page for further information.
loss: (LossQuadratic | LossBinomial | LossHuber | LossAbsolute | LossQuantile)
An initialized S4 loss object (requires to call Loss*$new(...)). See the respective help page for further information.
learning_rate: (numeric(1))
The learning rate of the model. Note: Some optimizer do dynamically vary the learning rate.
model: (Compboost_internal)
The internal Compboost object exported from Rcpp. See ?Compboost_internal for details.
bl_factory_list: ([BlearnerFactoryList)
A container with all base learners. See ?BlearnerFactoryList for details.
positive: (character(1))
The positive class in the case of binary classification.
stop_all: (logical(1))
Indicator whether all stopper must return TRUE to early stop the algorithm. Comparable to all() if stop_all = TRUE and any() if stop_all = FALSE.
early_stop: (logical(1))
Indicator whether early stopping is used or not.

Active bindings

offset

(numeric())
Offset of the estimated model.

baselearner_list

(list())
Named list with names $getBaselearnerNames(). Each elements contains

"feature" (character(1)): The name of the feature from data.
"factory" (Baselearner*): The raw base learner as factoryobject. See ?Baselearner* for details.

boost_intercept

(logical(1))
Logical value indicating whether an intercept base learner was added with $addIntercept() or not.

logs

(data.frame)
Basic information such as risk, selected base learner etc. about each iteration. If oob_data is set, further information about the validation/oob risk is also logged. The same applies for time logging etc. Note: Using the field logs internally is set and updated after each call to $getLoggerData(). Hence, it cashes the logged data set instead of recalculating the data set as it is done for $getLoggerData().

idx_oob

(integer())
An index vector used to split data into data = data[idx_train, ] and data_oob = data[idx_oob, ]. Note: oob_fraction is ignored if this argument is set.

idx_train

(integer())
An index vector used to split data into data = data[idx_train, ] and data_oob = data[idx_oob, ]. Note: oob_fraction is ignored if this argument is set.

Methods

Method `new()`

Creates a new instance of this R6 class.

Usage

Compboost$new(
  data = NULL,
  target = NULL,
  optimizer = NULL,
  loss = NULL,
  learning_rate = 0.05,
  positive = NULL,
  oob_fraction = NULL,
  early_stop = FALSE,
  idx_oob = NULL,
  stop_args = list(eps_for_break = 0, patience = 10L),
  file = NULL
)

Arguments

data: (data.frame)
The data set to build the object. Note: This data set is completely used for training if is.null(idx_oob). Otherwise, the data set is split into data = data[idx_train, ] and data_oob = data[idx_oob, ].
target: (character(1))
Character indicating the name of the target variable.
optimizer: (OptimizerCoordinateDescent | OptimizerCoordinateDescentLineSearch | OptimizerAGBM | OptimizerCosineAnnealing)
An initialized S4 optimizer object (requires to call Optimizer*.new(..). See the respective help page for further information.
loss: (LossQuadratic | LossBinomial | LossHuber | LossAbsolute | LossQuantile)
An initialized S4 loss object (requires to call Loss*$new(...)). See the respective help page for further information.
learning_rate: (numeric(1))
Learning rate of the model (default is 0.05).
positive: (character(1))
The name of the positive class (in the case of binary classification).
oob_fraction: (numeric(1))
The fraction of nrow(input data) defining the number of observations in data_oob. This argument is ignored if idx_oob is set.
early_stop: (logical(1))
Indicator whether early stopping should be used or not.
idx_oob: (integer())
An index vector used to split data into data = data[idx_train, ] and data_oob = data[idx_oob, ]. Note: oob_fraction is ignored if this argument is set.
stop_args: (list(integer(1), integer(1)))
list containing two elements patience and eps_for_break which are used for early stopping.
file: (character(1)
File from which a model should be loaded. If NULL, data and target must be defined.

Method `addLogger()`

Add a logger to the model.

Usage

Compboost$addLogger(logger, use_as_stopper = FALSE, logger_id, ...)

Arguments

logger: (LoggerIteration | LoggerTime | LoggerInbagRisk | LoggerOobRisk)
The uninitialized logger.
use_as_stopper: (logical(1))
Indicator defining the logger as stopper considering it for early stopping.
logger_id: (character(1))
The id of the logger. This allows to define two logger of the same type (e.g. risk logging) but with different arguments.
...: Additional arguments passed to loger$new(logger_id, use_as_stopper, ...).

Method `getCurrentIteration()`

Get the number of the current iteration.

Usage

Compboost$getCurrentIteration()

Returns

integer(1) value.

Method `addIntercept()`

This functions adds a base learner that adjusts the intercept (if selected). Adding an intercept base learner may be necessary, e.g., when adding linear effects without intercept.

Usage

Compboost$addIntercept(id = "intercept", data_source = InMemoryData)

Arguments

id: (character(1))
The id of the base learner (default is "intercept").
data_source: (InMemoryData)
Uninitialized data object used to store the meta data. Note: At the moment, just in memory storing is supported, see ?InMemorydata for details.
data_source: (InMemoryData)
Uninitialized data object used to store the meta data. Note: At the moment, just in memory storing is supported, see ?InMemorydata for details.

Method `addBaselearner()`

Add a base learner of one feature to the model that is considered in each iteration. Using $addBaselearner() just allows including univariate features. See $addTensor() for bivariate effect modelling and $addComponents() for an effect decomposition.

Usage

Compboost$addBaselearner(
  feature,
  id,
  bl_factory,
  data_source = InMemoryData,
  ...
)

Arguments

feature: (character(1))
Name of the feature, must be a column in data.
feature: (character(1))
Name of the feature, must be a column in data.
id: (character(1))
The name of the base learner.
bl_factory: (BaselearnerPolynomial | BaselearnerPSpline | BaselearnerCategoricalBinary | BaselearnerCategoricalRidge)
Uninitialized base learner class. See the respective help page for details.
data_source: (InMemoryData)
Uninitialized data object used to store the meta data. Note: At the moment, just in memory storing is supported, see ?InMemorydata for details.
data_source: (InMemoryData)
Uninitialized data object used to store the meta data. Note: At the moment, just in memory storing is supported, see ?InMemorydata for details.
...: Further argument spassed to the $new(...) constructor of bl_factory.

Method `rmBaselearner()`

Remove a base learner from the model.

Usage

Compboost$rmBaselearner(blname)

Arguments

blname: (character(1))
Name of the base learner that should be removed. Must be an element of $getBaselearnerNames().

Method `addTensor()`

Add a row-wise tensor product of features. Note: The base learner are pre-defined by the type of the feature. Numerical features uses a BaselearnerPSpline while categorical features are included using a BaselearnerCategoricalRidge base learner. To include an arbitrary tensor product requires to use the S4 API with using BaselearnerTensor on two base learners of any type.

Usage

Compboost$addTensor(
  feature1,
  feature2,
  df = NULL,
  df1 = NULL,
  df2 = NULL,
  isotrop = FALSE,
  ...
)

Arguments

feature1: (character(1))
Name of the first feature. Must be an element of names(data).
feature2: (character(1))
Name of the second feature. Must be an element of names(data).
df: (numeric(1))
The degrees of freedom used for both base learner (this parameter overwrites df1 and df2).
df1: (numeric(1))
The degrees of freedom used for the first base learner.
df2: (numeric(1))
The degrees of freedom used for the first base learner.
isotrop: (logical(1))
Indicator how the two penalties should be combined, if isotrop == TRUE, the total degrees of freedom are uniformly distributed over the dimensions while isotrop == FALSE allows to define how strong each of the two dimensions is penalized.
...: Additional arguments passed to the $new() constructor of the BaselearnerPSpline class.

Method `addComponents()`

Add an effect with individual components. A linear term is added as well as a non-linear term without the linear effect. This ensures that the linear component is selected prior to the non-linear effect. The non-linear effect is only included if a deviation from a linear effect is required.

Note: Internally, a BaselearnerPolynomial with degree one and a BaselearnerCentered is added. Centering a base learner makes the design matrix dense and hence memory is filled very fast. Considering binning may be an option to reduce the memory consumption.

Usage

Compboost$addComponents(feature, ...)

Arguments

feature: (character(1))
Name of the feature, must be a column in data.
feature: (character(1))
Name of the feature, must be a column in data.
...: Additional arguments passed to the $new() constructor of the BaselearnerPSpline class.

Method `train()`

Start fitting a model.

Usage

Compboost$train(iteration = 100, trace = -1)

Arguments

iteration: (integer(1))
The maximal number of iteration. The algorithm can be stopped earlier if early stopping is active.
trace: (integer(1))
The number of integers after which the status of the fitting is printed to the screen. The default trace = -1 internally uses trace = round(iteration / 40). To silently fit the model use trace = 0.

Method `prepareData()`

Internally, each base learner is build on a InMemoryData object. Some methods (e.g. adding a LoggerOobRisk) requires to pass the data as list(InMemoryData | CategoricalDataRaw) with data objects as elements. This function converts the given data.frame into that format.

Usage

Compboost$prepareData(newdata)

Arguments

newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.

Returns

list(InMemoryData | CategoricalDataRaw) with data container as elements. Numeric features are wrapped by InMemoryData while categorical features are included with CategoricalDataRaw.

Method `prepareResponse()`

Same as for $prepareData() but for the response. Internally, vectorToResponse() is used to generate a ResponseRegr or ResponseBinaryClassif object.

Usage

Compboost$prepareResponse(response)

Arguments

response: (vector())
A vector of type numberic or categorical that is transformed to an response object.

Returns

ResponseRegr | ResponseBinaryClassif object.

Method `predict()`

Calculate predictions.

Usage

Compboost$predict(newdata = NULL, as_response = FALSE)

Arguments

newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
as_response: (logical(1))
In the case of binary classification, as_response = TRUE returns predictions as response, i.e. classes.

Returns

Vector of predictions.

Method `predictIndividual()`

While $predict() returns the sum of all base learner predictions, this function returns a list with the predictions for each base learner.

Usage

Compboost$predictIndividual(newdata)

Arguments

newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.

Returns

Named list() with the included base learner names as names and the base learner predictions as elements.

Method `transformData()`

Get design matrices of all (or a subset) base learners for a new data.frame.

Usage

Compboost$transformData(newdata, blnames = NULL)

Arguments

newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
newdata: (data.frame)
New data set of the same structure as data.
blnames: (character())
Names of the base learners for which the design matrices are returned. If is.null(blnames), compboost tries to guess all base learners that were constructed based on the feature names of newdata.

Returns

list(matrix | Matrix::Matrix) matrices as elements.

Method `getInbagRisk()`

Return the training risk of each iteration.

Usage

Compboost$getInbagRisk()

Returns

numeric() vector of risk values or NULL if $train() was not called previously.

Method `getSelectedBaselearner()`

Get a vector with the name of the selected base learner of each iteration.

Usage

Compboost$getSelectedBaselearner()

Returns

character() vector of base learner names.

Method `print()`

Printer of the object.

Usage

Compboost$print()

Returns

Invisibly returns the object.

Method `getCoef()`

Get the estimated coefficients.

Usage

Compboost$getCoef()

Returns

list(pars, offset) with estimated coefficients/parameters and intercept/offset.

Method `getEstimatedCoef()`

DEPRICATED use $getCoef() instead. Get the estimated coefficients.

Usage

Compboost$getEstimatedCoef()

Returns

list(pars, offset) with estimated coefficients/parameters and intercept/offset.

Method `getBaselearnerNames()`

Get the names of the registered base learners.

Usage

Compboost$getBaselearnerNames()

Returns

charcter() of base learner names.

Method `getLoggerData()`

Get the logged information.

Usage

Compboost$getLoggerData()

Returns

data.frame of logging information.

Method `calculateFeatureImportance()`

Calculate feature important based on the training risk. Note that early stopping should be used to get adequate importance measures.

Usage

Compboost$calculateFeatureImportance(
  num_feats = NULL,
  aggregate_bl_by_feat = FALSE
)

Arguments

num_feats: (integer(1))
The number considered features, the num_feats most important feature names and the respective value is returned. If num_feats = NULL, all features are considered.
aggregate_bl_by_feat: (logical(1))
Indicator whether the importance is aggregated based on feature level. For example, adding components included two different base learners for the same feature. If aggregate_bl_by_feat == TRUE, the importance of these two base learners is aggregated instead of considering them individually.

Returns

Named numeric() vector of length num_feats (if at least num_feats were selected) with importance values as elements.

Method `saveToJson()`

Save a Compboost object to a JSON file. Because of the underlying C++ objects, it is not possible to use R's native load and save methods.

Usage

Compboost$saveToJson(file, rm_data = FALSE)

Arguments

file: (character(1))
Name/path to the file.
rm_data: (logical(1))
Remove all data from the model. This applies to the training data, response, as well as the test data and response used for the test risk logging. Note: If data is removed, no continuation of the training is possible after reloading. Also, everything related to predictions based on the training data throws an error.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Compboost$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

cboost = Compboost$new(mtcars, "mpg", loss = LossQuadratic$new(), oob_fraction = 0.3)
cboost$addBaselearner("hp", "spline", BaselearnerPSpline, degree = 3,
  n_knots = 10, df = 3, differences = 2)
cboost$addBaselearner("wt", "spline", BaselearnerPSpline)
cboost$train(1000, 0)
#> Train 1000 iterations in 0 Seconds.
#> Final risk based on the train set: 0.43
#> 

table(cboost$getSelectedBaselearner())
#> 
#> hp_spline wt_spline 
#>       366       634 
head(cboost$logs)
#>   _iterations oob_risk baselearner train_risk
#> 1           0       NA   intercept   16.31514
#> 2           1 19.31421   wt_spline   14.93635
#> 3           2 18.03059   wt_spline   13.69065
#> 4           3 16.87434   wt_spline   12.56514
#> 5           4 15.83312   wt_spline   11.54815
#> 6           5 14.89575   wt_spline   10.62915
names(cboost$baselearner_list)
#> [1] "hp_spline" "wt_spline"

# Access information about the a base learner in the list:
cboost$baselearner_list$hp_spline$factory$getDF()
#>      [,1]
#> [1,]    3
cboost$baselearner_list$hp_spline$factory$getPenalty()
#>          [,1]
#> [1,] 67.47968
plotBaselearner(cboost, "hp_spline")

References

Public fields

Active bindings

Methods

Public methods

Method new()

Usage

Arguments

Method addLogger()

Usage

Arguments

Method getCurrentIteration()

Usage

Returns

Method addIntercept()

Usage

Arguments

Method addBaselearner()

Usage

Arguments

Method rmBaselearner()

Usage

Arguments

Method addTensor()

Usage

Arguments

Method addComponents()

Usage

Arguments

Method train()

Usage

Arguments

Method prepareData()

Usage

Arguments

Returns

Method prepareResponse()

Usage

Arguments

Returns

Method predict()

Usage

Arguments

Returns

Method predictIndividual()

Usage

Arguments

Returns

Method transformData()

Usage

Arguments

Returns

Method getInbagRisk()

Usage

Returns

Method getSelectedBaselearner()

Usage

Returns

Method print()

Usage

Returns

Method getCoef()

Usage

Returns

Method getEstimatedCoef()

Usage

Returns

Method getBaselearnerNames()

Usage

Returns

Method getLoggerData()

Usage

Returns

Method calculateFeatureImportance()

Usage

Arguments

Returns

Method saveToJson()

Usage

Method `new()`

Method `addLogger()`

Method `getCurrentIteration()`

Method `addIntercept()`

Method `addBaselearner()`

Method `rmBaselearner()`

Method `addTensor()`

Method `addComponents()`

Method `train()`

Method `prepareData()`

Method `prepareResponse()`

Method `predict()`

Method `predictIndividual()`

Method `transformData()`

Method `getInbagRisk()`

Method `getSelectedBaselearner()`

Method `print()`

Method `getCoef()`

Method `getEstimatedCoef()`

Method `getBaselearnerNames()`

Method `getLoggerData()`

Method `calculateFeatureImportance()`

Method `saveToJson()`

Method `clone()`