adaBoostClassifier

Syntax

adaBoostClassifier(ds, yColName, xColNames, numClasses, [maxFeatures=0], [numTrees=10], [numBins=32], [maxDepth=10], [minImpurityDecrease=0.0], [learningRate=0.1], [algorithm=’SAMME.R’], [randomSeed])

Arguments

ds is the data sources to be trained. It can be generated with function sqlDS.

yColName is a string indicating the name of the category column in the data sources.

xColNames is a string scalar/vector indicating the names of the feature columns in the data sources.

numClasses is a positive integer indicating the number of categories in the category column. The value of the category column must be integers in [0, numClasses).

maxFeatures is an integer or a floating number indicating the number of features to consider when looking for the best split. The default value is 0.

If maxFeatures is a positive integer, then consider maxFeatures features at each split.
If maxFeatures is 0, then sqrt(the number of feature columns) features are considered at each split.
If maxFeatures is a floating number between 0 and 1, then int(maxFeatures * the number of feature columns) features are considered at each split.

numTrees is a positive integer indicating the number of trees. The default value is 10.

numBins is a positive integer indicating the number of bins used when discretizing continuous features. The default value is 32. Increasing numBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication time.

maxDepth is a positive integer indicating the maximum depth of a tree. The default value is 10.

minImpurityDecrease a node will be split if this split induces a decrease of the Gini impurity greater than or equal to this value. The default value is 0.

learningRate is a positive floating number indicating the contribution of a regressor to the next regressor.

algorithm is a string indicating the algorithm used. It can take the value of either “SAMME.R” or “SAMME”. The default value is “SAMME.R”.

randomSeed is the seed used by the random number generator.

Details

Fit an AdaBoost classification model. The result is a dictionary with the following keys: numClasses, minImpurityDecrease, maxDepth, numBins, numTress, maxFeatures, model, modelName, xColNames, learningRate and algorithm. model is a tuple with the result of the trained trees; modelName is “AdaBoost Classifier”.

The fitted model can be used as an input for function predict.

Examples

Fit an AdaBoost classification model with simulated data:

$ t = table(100:0, `cls`x0`x1, [INT,DOUBLE,DOUBLE])
$ n=5
$ cls = take(0, n)
$ x0 = norm(0, 10, n)
$ x1 = norm(0, 10, n)
$ insert into t values (cls, x0, x1)
$ cls = take(1, n)
$ x0 = norm(1, 10, n)
$ x1 = norm(1, 10, n)
$ insert into t values (cls, x0, x1)
$ model = adaBoostClassifier(sqlDS(<select * from t>), `cls, `x0`x1, 2);

Use the fitted model in forecasting:

$ t1 = table(-0.5 0 1 2 as x0, -2 0 1 3 as x1)
$ predict(model, t1);

Save the fitted model to disk:

$ saveModel(model, "C:/DolphinDB/data/classifierModel.bin");
$ loadModel("C:/DolphinDB/data/classifierModel.bin");

Related functions: adaBoostRegressor, randomForestClassifier, randomForestRegressor