TensorFlow Decision Forests 0.1.3 open sourced

ade_sueb · May 20, 2021, 1:52am

woww… Top ML Guys answer my question, thank you very much @jbgordon and @Mathieu i often learn about machine learning from your videos…

Thanks @jbgordon agree, it’s different, neural network and Tree-based models. Even thought i more like using neural network, even for simple features for classification model than Tree-based Models.

And yeah @Mathieu this will be great combine Tree-based models with neural network…

Can’t wait to compare “Decision Forest” with XGBoost and other boost stuff

Kader · May 20, 2021, 5:38am

Waouh! All in one, I love it! Thank you guys for your tremendous work!

Quick question here please: how this new TensorFlow Decision Forest differs from the already Tree based algorithms we’ve got in tf.estimator module.

Also, does this new TF-DF library mean that no more need for those from scikit-learn or even xgboost ?

And last but not least, should we tag it tf-df or tfdf ?

Thx.

Mathieu · May 20, 2021, 7:33am

Hi Kader,

Thanks for the enthusiasm and the great questions

how this new TensorFlow Decision Forest differs from the already Tree based algorithms we’ve got in tf.estimator module.

There are two main differences: API and algorithms.

The API:

TF-DF uses the Keras API while tf.estimator.BoostedTrees uses the tf1 estimator API. We think TF-DF is simpler to use (no need to create feature columns, no input_fn, etc.) and to compose (e.g. stacking models with tf.Sequential, or use a tf-hub embedding for pre-processing).

The algorithms:

TF-DF is a collection of algorithms all implemented in c++. By default, it runs the classical/exact Random Forest and Gradient Boosted Machine algorithms, which are similar to scikit-learn or R Random Forest. With hyper-parameters, you can enable more recent logics, similar to the ones used in XGBoost, LightGBM, and even some newer ones (e.g. sparse oblique trees works very well ).

Tf.estimator.BoostedTreesEstimator is implemented in TensorFlow and can be seen as an approximate Gradient Boosted Trees algorithm with a mini-batch training procedure described in this paper. We didn’t implement this algorithm in TF-DF, because in all our experiments/projects one of the other algorithms performed better.

TF-DF and Tf.estimator.BoostedTreesEstimator don’t share any code.

Also, does this new TF-DF library mean that no more need for those from scikit-learn or even xgboost ?

Short answer: no!

There are many great decision forest libraries out there (XGBoost, CatBoost, LightGBM, SciKit, R gbm, R random Forest, R ranger, etc.), each one with a different set of algorithms and framework integration. It is awesome to have such diversity.

In general the right library is the one that can be used easily (e.g. depending on the infra constraints and modeling complexity) and give good results (which might vary slightly according to implementations, and depend on the problem).

TF-DF focuses on Python or C++, and integrates well into the TensorFlow toolbox, which we believe can be compelling in many use-cases.

And last but not least, should we tag it tf-df or tfdf ?

tf-df is the official shortcut.But https://tensorflow-prod.ospodiscourse.com/ does not support tags with “-”, so let’s do tfdf.

Kader · May 20, 2021, 11:58am

Short answer. No no I love it.

Personally, I believe Simplicity and Composition are the game changers (Time to market…) when it comes to choose the right libraries. Particularly if we want to avoid another AI winter.
Good results yes also is important for sure, but it is more business depended.

Longue vie à tfdf then!

#tfdf

Kareem_Negm · May 21, 2021, 1:38am

it’s an amazing library that will change Tensorflow modules from a library for deep learning only to a library that can work in all machine learning models, not just neural networks.
I wonder if there is a way to publish this library on important sites like Kaggle and if I can contribute

Martin_Marzi · May 26, 2021, 10:12am

Hi Guys,
thanks for bringing decision trees to TF!

Btw. are there any plans to extend the list of algorithms, I would especially appreciate having Quinlan’s C4.5, which is available in Weka.

Also, are there any methods (planned) you would suggest for interpretability of the available ensemble decision trees (e.g. random forest)? That would be very helpful since one of the important reasons for using decision trees is their interpretability.
Cheers

Mathieu · May 26, 2021, 2:04pm

Hello Martin

Btw. are there any plans to extend the list of algorithms, I would especially appreciate having Quinlan’s C4.5, which is available in Weka.

Yes and no.

We are planning (and currently working) on new algorithms for TF-DF. The choice of the next algorithm to implement is generally guided by application needs or literature results. Btw, we are open-sourced, all contributions are welcome ;).

Regarding C4.5, we don’t have any immediate plans to implement it. However, we have a CART implementation.The main difference between the two algorithms is on the pruning stage (and some of the features), but both are outputting decision trees that can be used more-or-less interchangeably.

If you need specifically C4.5, please create a feature request in Issues · tensorflow/decision-forests · GitHub.

For interpretability we have a few open-sourced algorithms (e.g. plotting of trees, structural and evaluation-based feature importances) and a few other algorithms should be open sourced soon (e.g. Breiman similarity, feature distribution in tree plotting). If you have specific algorithm needs, please feel free to create a feature request and/or to propose an implementation – we are always curious to hear.

Note on interpretability: interpretability is unfortunately an ambiguous term, and it covers various aspects, just to mention some: dataset data analysis, model analysis, debugging features, debugging a specific inference, trust on results, fairness (and its various definitions), tools to improve the model, etc. It’s a very wide subject. We do like using a plotted decision tree (training a CART model with TF-DF) as a dataset analysis technique: it will quickly tell you how the data is largely distributed. See the beginner tutorial.

Cheers,

Nipun_Kumar · May 26, 2021, 3:27pm

THANK YOU SO MUCH FOR THE WONDERFUL DETAILS AND GUIDANCE.

Martin_Marzi · May 27, 2021, 9:02am

Thanks a lot Mathieu for your answer.

regarding interpretability: I was more wondering if there is a way to get a global indication of the feature importance for the whole ensemble result. If I am not mistaken, with the provided methods you only can inspect specific trees inside the ensemble?

Btw. missing values are allowed in the attributes but not in the class label it seems, right?

thanks.

Mathieu · May 27, 2021, 1:47pm

Martin_Marzi / feature importances

TF-DF feature importances (also known as variable importances) express the “importance” of each individual features for the entire model. I think this is what you are looking for.

See the Model structure and feature importance section of the beginner tutorial for usage examples.

Note that some feature importances are specific to some models and hyper-parameter values. For example, The “Out-of-bag permutation mean decrease in accuracy” is only computed for a classification Random Forest model trained with compute_oob_variable_importances=True.

The Permutation Feature Importance chapter in Interpretable Machine Learning. Molnar is a great resource to learn about feature importances.

Missing values

You are right. Unlike features, labels cannot contain missing values.

Nipun_Kumar

Thanks

Martin_Marzi · May 31, 2021, 8:39am

Great thanks a lot @Mathieu for the explanation!

Martin_Marzi · June 1, 2021, 7:18am

Hi,
I was searching the documentation but could not find an option for cross-validation, is there one? Would be useful for small datasets.

thanks!

Mathieu · June 1, 2021, 9:14am

This is one of the features we have not opensourced yet . I created an issue to keep track of it.

In the meantime, sklearn toolbox is quite useful. Here is a verbose example of 10-cross validation of TF-DF using sklearn.

from sklearn.model_selection import KFold
import numpy as np

accuraties_per_fold = [] # Test accuracy on the individual folds.

# Run a 10-folds cross-validation.
for  fold_idx, (train_indices, test_indices) in enumerate(KFold(n_splits=10, shuffle=True).split(all_df)):

  print(f"Running fold {fold_idx+1}")

  # Extract the training and testing examples.
  sub_train_df = all_df.iloc[train_indices]
  sub_test_df = all_df.iloc[test_indices]

  # Convert the examples into tensorflow datasets.
  sub_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(sub_train_df, label="income")
  sub_test_df = tfdf.keras.pd_dataframe_to_tf_dataset(sub_test_df, label="income")

  # Train the model.
  model = tfdf.keras.GradientBoostedTreesModel()
  model.fit(sub_train_ds, verbose=False)

  # Evaluate the model.
  model.compile(metrics=["accuracy"])
  evaluation = model.evaluate(sub_test_df, return_dict=True, verbose=False)
  print(f"Evaluation {evaluation}")

  accuraties_per_fold.append(evaluation["accuracy"])

print(f"Cross-validated accuracy: {np.mean(accuraties_per_fold)}")

Output:

Running fold 1
Evaluation {'loss': 0.0, 'accuracy': 0.8780701756477356}
Running fold 2
Evaluation {'loss': 0.0, 'accuracy': 0.8833333253860474}
Running fold 3
Evaluation {'loss': 0.0, 'accuracy': 0.8841597437858582}
Running fold 4
Evaluation {'loss': 0.0, 'accuracy': 0.8692408800125122}
Running fold 5
Evaluation {'loss': 0.0, 'accuracy': 0.8679245114326477}
Running fold 6
Evaluation {'loss': 0.0, 'accuracy': 0.8639754056930542}
Running fold 7
Evaluation {'loss': 0.0, 'accuracy': 0.8745063543319702}
Running fold 8
Evaluation {'loss': 0.0, 'accuracy': 0.8679245114326477}
Running fold 9
Evaluation {'loss': 0.0, 'accuracy': 0.8609039187431335}
Running fold 10
Evaluation {'loss': 0.0, 'accuracy': 0.8613426685333252}
Cross-validated accuracy: 0.8711381494998932

Martin_Marzi · June 1, 2021, 3:16pm

Thanks @Mathieu for the code example. I am familiar with the option in sklearn, would be just nice to have it integrated in tfdf, so thanks for keeping it in mind!

craigacp · June 4, 2021, 10:10pm

Any plans to integrate this into the TF C API so we can wrap it in TensorFlow-Java?

Mathieu · June 7, 2021, 9:42am

Hi Adam,

Can you share some details about your specific requirements for TF C API (if any)?

Background

The training and inference ops are implemented in C++. They are directly callable with the C and C++ TF API.

Alternatively, Yggdrasil Decision Forests offerts a C++ API to the library.

craigacp · June 7, 2021, 1:36pm

Yeah, sorry, after looking more closely through the ops I think we’ll probably be able to use that from Java with the usual amount of tricks. I’m not so sure how well it’ll work in eager mode, but the C API doesn’t really do that very well at the moment anyway. I know we’ve hit issues with resource variables in the past, and I’m not sure what kind of resource the tree model is, so that might be another issue.

I’ll have a look at it in some more detail in the future.

Jan · June 7, 2021, 2:49pm

Please keep us posted how it goes: the same is probably valid for support in other languages, and we would add any information to the docs.

Oliver_Larkin · June 29, 2021, 12:43am

Thanks for making this available. I’m new to tensorflow and ML, I’m somewhat overwhelmed and I’m curious as to why this is separate from the main project and if the plan is to merge it at some point? As I understand - if I want to perform inference using a model saved from tfdf in C++ I need to use yddrasil-df’s C++ api, is that correct? The model would not load via the normal tensorflow C api (wrapped via the 3rd party cppflow). I am a bit confused since I saw the c++ code in the tfdf repo. Ideally I would like to be able to use tfdf and nn models in the same c++ binary. Wondering if I can do that with cppflow and without getting bazel involved.

Mathieu · June 30, 2021, 8:50am

Hi Oliver,

Thanks for the interest.

The code is separated to isolate tensorflow and non-tensorflow code. From the point of view of TF-DF, YggdrasilDF is one of the third party libraries. A TF-DF SavedModel is a Yggdrasil DF model and some extra data.

If your language is C / C++, a TF-DF model can be served as:

A classical SavedModel using the TensorFlow C++ or C APIs.
A Yggdrasil DF model using the Yggdrasil DF C++ API.

Both options are not fully equivalent: Solution 1. supports any complex SavedModel, including ones containing TensorFlow preprocessing or multiple sub models. However, this solution can be slow as TensorFlow adds a non-negligible complexity overhead. Solution 2. is designed for fast inference and ease of use. Compiling Yggdrasil is also significantly simpler and faster than compiling TensorFlow. However, this solution only supports “pure” Yggdrasil models (i.e. no TensorFlow preprocessing).

If your SavedModel model contains both TF-DF and NN components, you should use solution 1.