I see! I got it to work in Colab now which works for now, I am not sure how to add a validation set to it though (so have train, validation and test), any ideas?
Thanks a bunch!
I see! I got it to work in Colab now which works for now, I am not sure how to add a validation set to it though (so have train, validation and test), any ideas?
Thanks a bunch!
You can use the selfvalidation or validation in fit
for metrics. See decision-forests/beginner_colab.ipynb at main · tensorflow/decision-forests · GitHub
The error was caused by the absence of the TF-DF pip package for py3.6. This is now solved. Thanks for the alert :).
Others might see the same error if they try to install TF-DF via pip on Windows or MacOS – we’re working on releasing those soon, and will update our Known Issues docs when we do!
Thanks Bhack for the answer. Following are some more details:
Tl;dr: A validation set is not required for training (see rationale), and if you use one, you shouldn’t pass it to fit(); rather to evaluate().
Splitting your data into train/validation/test is a generally good practice for ML. The reason most people do this is to tune their training algorithm on held-out data to have better results without skewing their final test eval.
Decision forests generally deal with relatively small datasets, and TF-DF always internally holds out some parts of the training set to do something similar (stop training early if it looks like it will overfit). Because the datasets are small, it can be helpful to just train on all the examples from train + validation (concatenate them in the call to fit()).
You can use the model self-evaluation (e.g. out-of-bag for random forest) to get the held-out evaluation that is done during training.
If you want to evaluate your model on the validation split for another reason (e.g. hyperparameter tuning), you should call model.evaluate(validation_ds) manually. TF-DF always trains for exactly one epoch, so the evaluation you might expect from fit() while using a different TF Keras model won’t be what you get here.
Hope this helps!
Thank you Arvind for the extra details. So what will happen when you will pass the validation_data
arg to fit()
in this case?
I think that users have habits to naturally use validation_data
arg in fit
so it could be nice to have some disclaimer on the unexpected effects of this arg in the example/notebook or docs.
For now, nothing, unfortunately – which is different than what would happen in the usual Keras model: in the usual Keras model one would get a history of evaluations on the return.
For now we briefly documented the difference (see fit
method returns), but we are already working on fixing this – it should be coming in the next few days (now I notice we should have documented on validation_data
argument, as well as on the various models).
The simple work around is for now call model.evaluate()
on your validation data. Notice that DF only train on one epoch, so one will only get one evaluation on the validation dataset anyway.
So how will It interact with validation_steps
arg?
So I’m assuming you are asking in the model.fit()
method, right ?
If yes, it does, as usual (see keras.Model.fit API doc)
But the evaluation will return empty (for now) for TF-DF … until we fix this.
I was able to fit my RandomForest model, however when I try to convert it into tflite format it throws error.
The error is : InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array.
hi Krishnava, thanks for bringing that up.
Unfortunately TFLite does not yet implement TF-DF models. We definitely would like to implement that, if we see more need. Pls, if you don’t mind, create an “issue” in our github repository for that, so we can track others that may be interested in a TFLite version.
In the short term, for a very fast/cheap inference for a purely decision forest models, consider doing inference using the TF-DF C++ library called Yggdrasil. There is an example that you can use to get started – it will read the TF-DF saved model that you trained in TensorFlow directly.
The Decision Forest models served in this fashion are often incredibly low-latency / low-cost. You can measure the serving speed without writing code using the benchmark inference tool.
G’luck!
Just as an update, as of release 0.1.4 passing validation_data
(or other forms of validation input) to Model.fit()
should lead to an evaluation at the end of the epoch, that is returned back on the History
object returned by Model.fit
.
okay I’ll try again with better formatting this time:
Hey, it’s me again! Your input really helped and I just quickly wanted to run this by to see whether what I’m doing is as it is intended: I have one .csv that i split into training and testing, and another .csv that i want to use purely as testing set to compare with the first. I started out as per your beginner tutorial with this
model_1 = tfdf.keras.RandomForestModel(
compute_oob_variable_importances=True,
max_depth = 20,
num_trees = 250
)
model_1.compile(
metrics=["accuracy", tf.keras.metrics.Recall(), tf.keras.metrics.Precision(), tf.keras.metrics.FalseNegatives(), tf.keras.metrics.FalsePositives()])
with sys_pipes():
model_1.fit(x=train_ds)
and for the second pure test set did this:
test_ds_pd2 = dataset_df1
train_ds2 = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds2 = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd2, label=label)
model_2 = tfdf.keras.RandomForestModel(
compute_oob_variable_importances=True,
max_depth = 20,
num_trees = 250
)
model_2.compile(
metrics=["accuracy", tf.keras.metrics.Recall(), tf.keras.metrics.Precision(), tf.keras.metrics.FalseNegatives(), tf.keras.metrics.FalsePositives()])
with sys_pipes():
model_2.fit(x=train_ds)
does that make sense? I apologize for this very basic question, just want to ensure I got it correct
It makes sense. There are situations where having multiple test datasets, each with a different distribution is useful :).
In your current formulation, two independent models are trained (model_1
and model_2
), but none of them are evaluated on a test dataset. Here is something closer to what you describe:
# We assume the setup from the beginner colab (https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)
train_ds_pd, test_ds_pd = split_dataset(...)
# In addition, here is the second dataset you mentioned: The "purely testing" set.
# Note: "test_ds_pd" is also a pure testing set.
pure_test_ds_pd = ...
# Train the model on the train split.
model = tfdf.keras.RandomForestModel()
model.fit(tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label))
# Add some metrics for the model evaluation.
model.compile(metrics=["accuracy", tf.keras.metrics.Recall(), tf.keras.metrics.Precision(), tf.keras.metrics.FalseNegatives(), tf.keras.metrics.FalsePositives()])
# Evaluate the model on the test split of the first dataset.
evaluation_on_test = model.evaluate(tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label))
# Evaluate the model on the second dataset i.e. "the pure test" one.
evaluation_on_pure_test = model.evaluate(tfdf.keras.pd_dataframe_to_tf_dataset(pure_test_ds_pd, label=label))
Right, for some reason I trained 2 models witht the same parameters to then evaluate each set on one instead of both on the same model… Thanks a bunch for your fast reply!
Sorry for the many basic questions, but my dataset contains some numerical values but also a lot of booleans (shown as 0 and 1), which are used as numeric features by default, is that an issue? If yes, how do I fix it? And if not, does it have any other implications (e.g. for loss)?
As you correctly noted, TF-DF detects boolean features as numerical ones.
There is no impact (good or bad) on the quality or inference speed of the model.
However, this will impact slightly the training speed of the model.
Yggdrasil Decision Forests (the core library behind TF-DF) supports boolean features natively, so they should be made available in TF-DF soon .
Ah thanks, hopefully last one: how do I know which of my classes is my positive and negative class in binary classification? and can I change this or specify this somehow (other than switchingthe labels in the dataset)?
This is a good question :).
Keras does not support string labels for classification. Therefore, labels should be provided as a positive integer.
The function pd_dataframe_to_tf_dataset
runs an automatic string->integer conversion if the labels are stored as strings. If the label is already an integer, no mapping is applied. The string->integer mapping follows the lexicographic order (see code). This cannot be changed as the moment.
To obtain a model with a specific mapping, the simplest solution is to apply the desired mapping on the dataset manually before the training e.g. dataframe["label"] = dataframe["label"].map(my_mapping)
.
Thanks for the code walkthrough at this link: Decision forests in TensorFlow | Session - YouTube. In accessing the hyperparameter for RF is it possible to get access to the proximity matrix which is generated from the number of similarity counts at the leafs. If there is an API to extract this matrix it would be ideal. Thanks,
Vasanth