randomForestRegressor

Syntax

randomForestRegressor(ds, yColName, xColNames, [maxFeatures=0], [numTrees=10], [numBins=32], [maxDepth=32], [minImpurityDecrease=0.0], [numJobs=-1], randomSeed)

Arguments

ds is the data sources to be trained. It can be generated with function sqlDS .

yColName is a string indicating the dependent variable column.

xColNames is a string scalar/vector indicating the names of the feature columns.

maxFeatures is an integer or a floating number indicating the number of features to consider when looking for the best split. The default value is 0.

  • if maxFeatures is a positive integer, then consider maxFeatures features at each split.

  • if maxFeatures is 0, then sqrt(the number of feature columns) features are considered at each split.

  • if maxFeatures is a floating number between 0 and 1, then int(maxFeatures * the number of feature columns) features are considered at each split.

numTrees is a positive integer indicating the number of trees in the random forest. The default value is 10.

numBins is a positive integer indicating the number of bins used when discretizing continuous features. The default value is 32. Increasing numBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication time.

maxDepth is a positive integer indicating the maximum depth of a tree. The default value is 32.

minImpurityDecrease a node will be split if this split induces a decrease of impurity greater than or equal to this value. The default value is 0.

numJobs is an integer indicating the maximum number of concurrently running jobs if set to a positive number. If set to -1, all CPU threads are used. If set to another negative integer, (the number of all CPU threads + numJobs + 1) threads are used.

randomSeed is the seed used by the random number generator.

Details

Fit a random forest regression model. The result is a dictionary with the following keys: minImpurityDecrease, maxDepth, numBins, numTress, maxFeatures, model, modelName and xColNames. model is a tuple with the result of the trained trees; modelName is “Random Forest Regressor”.

The fitted model can be used as an input for function predict .

Examples

Fit a random forest regression model with simulated data:

$ x1 = rand(100.0, 100)
$ x2 = rand(100.0, 100)
$ b0 = 6
$ b1 = 1
$ b2 = -2
$ err = norm(0, 10, 100)
$ y = b0 + b1 * x1 + b2 * x2 + err
$ t = table(x1, x2, y)
$ model = randomForestRegressor(sqlDS(<select * from t>), `y, `x1`x2)
$ yhat=predict(model, t);

$ plot(y, yhat, ,SCATTER);

Save the trained model to disk:

$ saveModel(model, "C:/DolphinDB/Data/regressionModel.txt");

Load a saved model:

$ model=loadModel("C:/DolphinDB/Data/regressionModel.txt");