Are Tensorflow Decision Forest RandomForestModel attributes num_candidate_attributes and min_examples synonymous with sklearn's RandomForestRegressor's max_features and min_samples_split?

Diana_McSpadden · January 30, 2024, 3:30pm

I am comparing a tfdf RandomForestModel, specifically for a regression task, to the performance of an sklearn RandomForestRegressor model. The published hyperparameters for the sklearn model are:

max_features=6
n_estimators=50
max_depth=None
min_samples_split=2

I am not getting similar performance for the two models.

My constructors are the following and fit calls are the following:
sk_rf_model = RandomForestRegressor(max_features=6, n_estimators=50, max_depth=None, min_samples_split=2)
sk_rf_model.fit(X_npy, y_npy, sample_weight=train_data[‘sample_weight’])
RMSE: 0.01954
MAE: 0.0059

tfdf_rf_model = tfdf.keras.RandomForestModel(num_trees=50, verbose=2, num_candidate_attributes=6, min_examples=2, max_depth=None, task=tfdf.keras.Task.REGRESSION, num_threads=1)
tfdf_rf_model.model_1.fit(x=X_time_space_npy, y=y_npy, sample_weight=train_data[‘sample_weight’].to_numpy())
RMSE: 0.02304
MAE: 0.0088

I set the num_threads to 1 to compare single-threaded to single-threaded behavior, that does not alleviate the difference.

The previously published model I am comparing uses the noted RF hyperparameters.

Sklean RandomforestRegressor Documentation Hyperparameters:
max_features {“sqrt”, “log2”, None}, int or float, default=1.0
The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None or 1.0, then max_features=n_features.

min_samples_split int or float, default=2
The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Tensorflow Decisionforest Documentation Hyperparameters:

num_candidate_attributes Number of unique valid attributes tested for each node. An attribute is valid if it has at least a valid split. If num_candidate_attributes=0, the value is set to the classical default value for Random Forest: sqrt(number of input attributes) in case of classification and number_of_input_attributes / 3 in case of regression. If num_candidate_attributes=-1, all the attributes are tested. Default: 0.

min_examples Minimum number of examples in a node. Default: 5.

rstz · February 16, 2024, 8:43pm

Hi, TFDF author here,

apologies for the late reply - our Github issues are monitored more closely for questions.

The performance of Random Forests varies greatly with the random seed that is used. This probably explains most of the variance you see. For example, on the Abalone regression dataset, a 50 trees SkLearn RandomForestRegressor will achieve RMSE between 2.2800 and 2.21469 if varying random_state between 0 and 99. Note that picking the same random seed for TF-DF and Scikit-Learn will NOT make the results any closer.

Furthermore, there are some differences in the default hyperparameters of TF-DF and Scikit-Learn that your setup does not account for.

max_depth=None: In Scikit-Learn, this will grow tress unlimited (unless some other criteria is met). In TF-DF, it will grow trees up to depth 16. For unlimited growth in TF, set max_depth=-1 [1]
Split criterion (Gini vs. Entropy) and Voting mechanism (Weighted vs. Winner-takes-all) are also different, but this does not affect regression models.
Possibly other implementation differences (rounding errors etc.)

[1] I realize that this is not properly documented and we will fix this documentation issue of TF-DF