Sklearn random forests results seem to be different to tfdf's

keile · June 24, 2021, 11:00am

I used both sklearn’s random forests and tfdf on the same dataset. The results was very different between the two. Below was my configurations for the sklearn one.

RandomForestClassifier(n_estimators=1000, max_depth=16, oob_score=True, min_samples_leaf=1, random_state=42, n_jobs=-1)

I tried to use the same configurations with tfdf’s, but no luck. Please correct me on the configurations if I am wrong.

Mathieu · June 24, 2021, 3:32pm

Hi,

While both SkLearn and TF-DF implement the classical Random Forest algorithm, there is some few differences in between the implementations. For this reason, it is expected for the results (both the model structure and model quality) not to be exactly the same (but still very close).

Following are some parameter values that should make sklearn’s RandomForestClassifier as close as possible to TF-DF’s Random Forest.

PS: Random Forest and Gradient Boosted Trees are different algorithms.

n_estimators = 300
max_depth = 16
criterion = "entropy"
min_samples_split = 5

In addition, if the problem is regressive, make sure to have:

max_features = 1./3

If your dataset contains categorical or categorical-set features, there are not equivalent parameters for sklearn as it does not support those type of features.

If the differences are larges, it would be very interesting for us to look at it.

keile · June 25, 2021, 1:40am

Hello Mathieu,
Thanks for your answer!

The results are hugely different I’d say. I have 3 classes that I’d like to classify, please see the results below!

The result from Sklearn

            precision recall f1-score support
      -1       0.67      0.04      0.07       338
       0       0.71      0.99      0.83      1002
       1       0.00      0.00      0.00        86

The result from TFDF

          precision recall f1-score support
     -1       1.00      0.94      0.97       338
      0       0.90      1.00      0.95      1002
      1       0.00      0.00      0.00        86

I set everything just like the given code snippet. It’s intriguing, isn’t it?
The datasets I used for the 2 models were basically the same - all categorical data (text) was removed - The targets (ground truth) were mapped to positive integer index [0, 1, 2]. Basically, the ingredients for sklearn and TFDF are the same.

Notice that the dataset is very imbalanced, but the TFDF did a very impressive job. This is every cool but I don’t want be fooled by the metrics. I just wanna make sure the models work correctly. ^^

keile · June 29, 2021, 7:50am

Hello guys,

Just to clarify, the performance sklearn’s and tensorlfow’s random forests is the same. It was actually my fault in processing the data - I removed the most important feature out of the training data. In my case, the sklearn’s site works a little better. Have a nice day!

Jan · June 30, 2021, 10:04am

hi Keile, I’m happy to hear that. If it were significantly different (in any direction) we would be concerned/curious for more details.

cheers!