TensorFlow Decision Forests 0.1.3 open sourced

Jan · May 19, 2021, 5:11pm

We are happy to open source TensorFlow Decision Forests (TF-DF) for TensorFlow 2.5.0 and Keras.

See our short presentation in Google IO 2021.

You can find details and tutorials about the library at tensorflow.org/decision_forests.

Don’t hesitate to ask us questions! (tfdf tag)!

ade_sueb · May 19, 2021, 5:24pm

Great to hear this,
Is this Decision Forests is based on Traditional Machine Learning like random forest or XGBoost, or we can call it as Deep Learning?

jbgordon · May 19, 2021, 5:47pm

Good question. Tree-based models fall under traditional machine learning (in contrast to deep learning, which uses neural networks). The biggest difference is that when you’re working with trees, you usually have a smaller number of features that you can intuitively understand and reason about (for example, a patient’s blood pressure, or the number of times a baseball team has won a game). In deep learning, you typically have a very large number of features (for example, every pixel value in an image), and these individual features are not very meaningful (e.g., you couldn’t tell if an image was a cat or a dog just by knowing the value of some pixel).

Edit: see also the excellent point from Mathieu about how these techniques can be combined.

Mathieu · May 19, 2021, 5:51pm

You can say this is a “traditional/classical” ML :). For example, the core Random Forest implementation follows closely the Breiman 2001 paper, while the Gradient Boosted Tree follows the Friedman 1999 paper. And then we spent some time implementing several new techniques.

Because of the composability of the TensorFlow/Keras APIs, TF-DF models can be composed into larger models. For example, stacking a Decision forest on top of a pre-trained Neural network embedding works really well. Ensembling decision forest and Neural Network is also a strong modeling solution.

So in a sense, this is a classical+deep learning library ?

ade_sueb · May 20, 2021, 1:52am

woww… Top ML Guys answer my question, thank you very much @jbgordon and @Mathieu i often learn about machine learning from your videos…

Thanks @jbgordon agree, it’s different, neural network and Tree-based models. Even thought i more like using neural network, even for simple features for classification model than Tree-based Models.

And yeah @Mathieu this will be great combine Tree-based models with neural network…

Can’t wait to compare “Decision Forest” with XGBoost and other boost stuff

Kader · May 20, 2021, 5:38am

Waouh! All in one, I love it! Thank you guys for your tremendous work!

Quick question here please: how this new TensorFlow Decision Forest differs from the already Tree based algorithms we’ve got in tf.estimator module.

Also, does this new TF-DF library mean that no more need for those from scikit-learn or even xgboost ?

And last but not least, should we tag it tf-df or tfdf ?

Thx.

Mathieu · May 20, 2021, 7:33am

Hi Kader,

Thanks for the enthusiasm and the great questions

how this new TensorFlow Decision Forest differs from the already Tree based algorithms we’ve got in tf.estimator module.

There are two main differences: API and algorithms.

The API:

TF-DF uses the Keras API while tf.estimator.BoostedTrees uses the tf1 estimator API. We think TF-DF is simpler to use (no need to create feature columns, no input_fn, etc.) and to compose (e.g. stacking models with tf.Sequential, or use a tf-hub embedding for pre-processing).

The algorithms:

TF-DF is a collection of algorithms all implemented in c++. By default, it runs the classical/exact Random Forest and Gradient Boosted Machine algorithms, which are similar to scikit-learn or R Random Forest. With hyper-parameters, you can enable more recent logics, similar to the ones used in XGBoost, LightGBM, and even some newer ones (e.g. sparse oblique trees works very well ).

Tf.estimator.BoostedTreesEstimator is implemented in TensorFlow and can be seen as an approximate Gradient Boosted Trees algorithm with a mini-batch training procedure described in this paper. We didn’t implement this algorithm in TF-DF, because in all our experiments/projects one of the other algorithms performed better.

TF-DF and Tf.estimator.BoostedTreesEstimator don’t share any code.

Also, does this new TF-DF library mean that no more need for those from scikit-learn or even xgboost ?

Short answer: no!

There are many great decision forest libraries out there (XGBoost, CatBoost, LightGBM, SciKit, R gbm, R random Forest, R ranger, etc.), each one with a different set of algorithms and framework integration. It is awesome to have such diversity.

In general the right library is the one that can be used easily (e.g. depending on the infra constraints and modeling complexity) and give good results (which might vary slightly according to implementations, and depend on the problem).

TF-DF focuses on Python or C++, and integrates well into the TensorFlow toolbox, which we believe can be compelling in many use-cases.

And last but not least, should we tag it tf-df or tfdf ?

tf-df is the official shortcut.But https://tensorflow-prod.ospodiscourse.com/ does not support tags with “-”, so let’s do tfdf.

Kader · May 20, 2021, 11:58am

Short answer. No no I love it.

Personally, I believe Simplicity and Composition are the game changers (Time to market…) when it comes to choose the right libraries. Particularly if we want to avoid another AI winter.
Good results yes also is important for sure, but it is more business depended.

Longue vie à tfdf then!

#tfdf

Kareem_Negm · May 21, 2021, 1:38am

it’s an amazing library that will change Tensorflow modules from a library for deep learning only to a library that can work in all machine learning models, not just neural networks.
I wonder if there is a way to publish this library on important sites like Kaggle and if I can contribute

Martin_Marzi · May 26, 2021, 10:12am

Hi Guys,
thanks for bringing decision trees to TF!

Btw. are there any plans to extend the list of algorithms, I would especially appreciate having Quinlan’s C4.5, which is available in Weka.

Also, are there any methods (planned) you would suggest for interpretability of the available ensemble decision trees (e.g. random forest)? That would be very helpful since one of the important reasons for using decision trees is their interpretability.
Cheers

Mathieu · May 26, 2021, 2:04pm

Hello Martin

Btw. are there any plans to extend the list of algorithms, I would especially appreciate having Quinlan’s C4.5, which is available in Weka.

Yes and no.

We are planning (and currently working) on new algorithms for TF-DF. The choice of the next algorithm to implement is generally guided by application needs or literature results. Btw, we are open-sourced, all contributions are welcome ;).

Regarding C4.5, we don’t have any immediate plans to implement it. However, we have a CART implementation.The main difference between the two algorithms is on the pruning stage (and some of the features), but both are outputting decision trees that can be used more-or-less interchangeably.

If you need specifically C4.5, please create a feature request in Issues · tensorflow/decision-forests · GitHub.

For interpretability we have a few open-sourced algorithms (e.g. plotting of trees, structural and evaluation-based feature importances) and a few other algorithms should be open sourced soon (e.g. Breiman similarity, feature distribution in tree plotting). If you have specific algorithm needs, please feel free to create a feature request and/or to propose an implementation – we are always curious to hear.

Note on interpretability: interpretability is unfortunately an ambiguous term, and it covers various aspects, just to mention some: dataset data analysis, model analysis, debugging features, debugging a specific inference, trust on results, fairness (and its various definitions), tools to improve the model, etc. It’s a very wide subject. We do like using a plotted decision tree (training a CART model with TF-DF) as a dataset analysis technique: it will quickly tell you how the data is largely distributed. See the beginner tutorial.

Cheers,

Nipun_Kumar · May 26, 2021, 3:27pm

THANK YOU SO MUCH FOR THE WONDERFUL DETAILS AND GUIDANCE.

Martin_Marzi · May 27, 2021, 9:02am

Thanks a lot Mathieu for your answer.

regarding interpretability: I was more wondering if there is a way to get a global indication of the feature importance for the whole ensemble result. If I am not mistaken, with the provided methods you only can inspect specific trees inside the ensemble?

Btw. missing values are allowed in the attributes but not in the class label it seems, right?

thanks.

Mathieu · May 27, 2021, 1:47pm

Martin_Marzi / feature importances

TF-DF feature importances (also known as variable importances) express the “importance” of each individual features for the entire model. I think this is what you are looking for.

See the Model structure and feature importance section of the beginner tutorial for usage examples.

Note that some feature importances are specific to some models and hyper-parameter values. For example, The “Out-of-bag permutation mean decrease in accuracy” is only computed for a classification Random Forest model trained with compute_oob_variable_importances=True.

The Permutation Feature Importance chapter in Interpretable Machine Learning. Molnar is a great resource to learn about feature importances.

Missing values

You are right. Unlike features, labels cannot contain missing values.

Nipun_Kumar

Thanks

Martin_Marzi · May 31, 2021, 8:39am

Great thanks a lot @Mathieu for the explanation!

Martin_Marzi · June 1, 2021, 7:18am

Hi,
I was searching the documentation but could not find an option for cross-validation, is there one? Would be useful for small datasets.

thanks!

Mathieu · June 1, 2021, 9:14am

This is one of the features we have not opensourced yet . I created an issue to keep track of it.

In the meantime, sklearn toolbox is quite useful. Here is a verbose example of 10-cross validation of TF-DF using sklearn.

from sklearn.model_selection import KFold
import numpy as np

accuraties_per_fold = [] # Test accuracy on the individual folds.

# Run a 10-folds cross-validation.
for  fold_idx, (train_indices, test_indices) in enumerate(KFold(n_splits=10, shuffle=True).split(all_df)):

  print(f"Running fold {fold_idx+1}")

  # Extract the training and testing examples.
  sub_train_df = all_df.iloc[train_indices]
  sub_test_df = all_df.iloc[test_indices]

  # Convert the examples into tensorflow datasets.
  sub_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_train_df, label="income")
  sub_test_df = tfdf.keras.pd_dataframe_to_tf_dataset(sub_test_df, label="income")

  # Train the model.
  model = tfdf.keras.GradientBoostedTreesModel()
  model.fit(sub_train_ds, verbose=False)

  # Evaluate the model.
  model.compile(metrics=["accuracy"])
  evaluation = model.evaluate(sub_test_df, return_dict=True, verbose=False)
  print(f"Evaluation {evaluation}")

  accuraties_per_fold.append(evaluation["accuracy"])

print(f"Cross-validated accuracy: {np.mean(accuraties_per_fold)}")

Output:

Running fold 1
Evaluation {'loss': 0.0, 'accuracy': 0.8780701756477356}
Running fold 2
Evaluation {'loss': 0.0, 'accuracy': 0.8833333253860474}
Running fold 3
Evaluation {'loss': 0.0, 'accuracy': 0.8841597437858582}
Running fold 4
Evaluation {'loss': 0.0, 'accuracy': 0.8692408800125122}
Running fold 5
Evaluation {'loss': 0.0, 'accuracy': 0.8679245114326477}
Running fold 6
Evaluation {'loss': 0.0, 'accuracy': 0.8639754056930542}
Running fold 7
Evaluation {'loss': 0.0, 'accuracy': 0.8745063543319702}
Running fold 8
Evaluation {'loss': 0.0, 'accuracy': 0.8679245114326477}
Running fold 9
Evaluation {'loss': 0.0, 'accuracy': 0.8609039187431335}
Running fold 10
Evaluation {'loss': 0.0, 'accuracy': 0.8613426685333252}
Cross-validated accuracy: 0.8711381494998932

Martin_Marzi · June 1, 2021, 3:16pm

Thanks @Mathieu for the code example. I am familiar with the option in sklearn, would be just nice to have it integrated in tfdf, so thanks for keeping it in mind!

craigacp · June 4, 2021, 10:10pm

Any plans to integrate this into the TF C API so we can wrap it in TensorFlow-Java?

Mathieu · June 7, 2021, 9:42am

Hi Adam,

Can you share some details about your specific requirements for TF C API (if any)?

Background

The training and inference ops are implemented in C++. They are directly callable with the C and C++ TF API.

Alternatively, Yggdrasil Decision Forests offerts a C++ API to the library.