Decision Forest - Random Forests RAM issues

Dear TFDF developer(s),
I am happy to be trying out the TFDF package. I am having some RAM issues, I am not using Colab. I first tested it out using the Penguins data as suggested in Introducing TensorFlow Decision Forests — The TensorFlow Blog
with no issues. I also did a little research and found out the default num_trees=300 and max_depth=16, I believe from help(tfdf.keras.RandomForestModel). Then I moved up to what for me is a “middle of the road” size-wise sparse data matrix of 50k rows and 22k columns using the same loading method as in the blog and deleting data frames as I went. I am using a 32GB RAM ubuntu 18.04 instance and I track RAM using top. I stuck with the default settings and watched it evaporate until finally “Killed”. Not unexpected as the available RAM was heading to 0 and presumably this is a safeguard. I then built the same model using Sklearn, 300 trees & max_depth = None and RAM usage maxed out at about 5.4 GB. I then tried setting TFDF sorting_strategy=IN_NODE and was again watching the RAM increase to about 50% until I got a NameError: ‘IN_NODE’ is not defined. Regardless the memory using was already at ~16GB.
A six-fold increase in RAM seems out-of-bounds so I am wondering if there’s a memory issue that hasn’t been exposed yet or if perhaps I am doing something wrong. I am using the default settings and following the loading process from the blog, so I think I am doing things correctly. Please advise.
thanks,
Dan711

1 Like

I obviously miss-stated when I said “I then built the same model using Sklearn, 300 trees & max_depth = None”, I should have said “I then built a similar model”… My mistake, but I think the content remains valid.
thanks again,
Dan711

Hi Dan,

Thanks for the setup description :slight_smile: and the bug report. 22k features is not something we tried in TF-DF before, but it should be interesting and I’ll keep you in touch there.

The raw dataset takes 50k x 22k x 4bytes/values = 4.4GB of memory (assuming you have numerical features, and with the IN_NODE strategy). So this aspect should be fine. From the top of my head, the likely culprit is the tensorflow graph: TF-DF will instantiate one TF Operation for each of the features. Since TF Ops have a non negligible cost, this could be the issue. Using Yggdrasil DF directly is an easy way to try.

Hi Mathieu,
The way I am storing the ints should only require 1 byte per feature, uint8, but I will double check. I couldn’t get the TFDF sorting_strategy=IN_NODE to work, I was watching the RAM increase to about 50% until I got a NameError: ‘IN_NODE’ is not defined. Can you please give me the correct way to call it?

I was also wondering about the TF graph, which I don’t completely understand, but then wouldn’t many layer DNNs blow up similarly? Is there any way to reduce the size of this graph, other than reducing the feature set?

thanks,
Dan

Hi Dan,

The way I am storing the ints should only require 1 byte per feature, uint8,

During training, the features are stored in memory. This internal storage representation depends of the feature semantic, but not how the features are feed. For example, numerical features are stored as float (4 bytes), boolean as stored as bytes (1 byte) and pre-discretized numerical features are stored as int16 (2 bytes).

NameError: ‘IN_NODE’ is not defined

This looks like a python argument error :).

Make sure to call:

model = tfdf.keras.RandomForestModel(sorting_strategy="IN_NODE")

instead of

model = tfdf.keras.RandomForestModel(sorting_strategy=IN_NODE)

I was also wondering about the TF graph, which I don’t completely understand, but then wouldn’t many layer DNNs blow up similarly?

This is a limitation of the current TF-DF code (which call Yggdrasil DF underneath). I am creating this feature request for it.

Thanks for all your help Mathieu! When you say pre-discretized, I think of binning, is that what you mean? I am wondering how I can leverage this information about how features are stored in memory to reduce my footprint. Now that you created the feature request I will sit tight for a bit, I will definitely explore further if that RAM limitation can be overcome.

When you say pre-discretized, I think of binning, is that what you mean?

Exactly. Instead of requiring the floating point values (4 bytes) and possibly an index (also 4 bytes) for each numerical values, a pre-discretized numerical value requires 2 bytes.

However, like for the boolean features (1 byte per value), the pre-discretized numericals are not yet available in the TF-DF wrapper (only Yggdrasil). I.e., this would be a feature request :slight_smile: .

Thanks for the response Mathieu, sorry for the slow response, I was on a little vacation. Originally I was hoping to test the speed of TF-DF, hoping it would be fast. Has anyone done any sort of “bakeoff” = timing tests? This was a goal I had when I started experimenting.

This time Mathieu is the one on vacation :slight_smile:

We did very informal/quick comparative benchmarks (*), not worth publishing – it’s always very problem dependent anyway. I’m curious to hear what you get.

Are you going to benchmark training time or inference time ?

If you care a lot about inference latency/speed, consider using directly Yggdrasil (the C++ API, it’s pretty simple and it can read models trained in TF), since TF, for its flexibility, adds some significant overhead.

Btw, to make it easy (or just as an example), see our benchmark_inference command line tool.

(*) Our impression is that training time is good (not awesome), and inference time is well optimized (at least with some of the inference engines).

Hi Jan,
I was going to benchmark training time, for most use cases I have inference time is less of a problem, but still important. I was hoping to just leverage the TF-DF wrapper and attain comparable/better training times than sklearn and go on from there. However, the memory issues are a bit of a showstopper for me as I deal with high dimensional data and don’t want to necessarily execute feature reduction.(Hence the desire for RF). Anyway, everything is a work in progress. I will look at the inference benchmark tool as you suggested.
thanks,
Dan

hi Dan,

I’m sorry about the memory issue in your case. It is very likely due to the high cost of TF Ops – we create one per feature, and that breaks when you have tens of thousands of features. In our list of things to fix (it should be fixable), but it will have to wait Mathieu to be back.

In the short term, there is the Yggdrasil command line trainer that won’t suffer from this – but with the disadvantage that you will not be in TF – Another TODO for us is to create a converter from Yggdrasil model to TF SavedModel, so it can easily be imported back to TF. Internally we have that, but it needs some updates before OSSing. Hopefully we can get this done this quarter.

Apologies for not being able to fix this faster …

Note: DNNs usually do all in fewer (but “fatter”) TF Ops, that do giant matrix multiplications (so they take O(num_layers) TF Ops). If a DNN had 10s of thousands of layers, it would likely suffer from the same issue.

1 Like

Hi Jan,
Thanks for your response. RE:“the disadvantage that you will not be in TF” I am not well versed in all the advantages TF has to offer, only some that come with DNNs and Keras and then a very little of how it might leverage Collab for XAI purposes. The latter is not necessarily great for me as I am not sure about using Collab and data confidentiality. Regardless, your comment intrigues me as it leads me to believe there’s a lot I might be missing. Is there a good blog series or tutorial you might recommend so I can fully leverage TF?
thanks,
Dan

hi Dan, about the TF advantages, mostly the simple/obvious things:

  • Well integrated with colab/notebook.
  • Easy to train various different types of models all in one platform (and using the same evaluation), then you can just choose the best.
  • Easy to leverage pre-trained models for text/image/object detection/etc on the input (see tensorflow.org/hub) – one can use the output of these pre-trained model as the input of TF-DF models.

When it comes to productionization, it also work with some (but not all yet) tools of TFX: see tensorflow.org/tfx – things like data validation.

If one is dedicated, one can always stitch these things without using TF-DF, it just makes it easier.

Btw, about data confidentiality: colab.research.google.com also works with your own (locally, or remotely) running kernel – so you can be sure to preserve confidentiality of data. I often do that when developing with colab – I can reload python files I’m editing locally. Same with jupyther notebook front-end, I just like colab’s interface better.

cheers,
Jan

1 Like

Thanks Jan, I was unaware that colab could be run locally, I will have to look into that. It could open up some possibilities for me. Question, How do I track the potential modifications that have been discussed? I see there are 439 help request tagged items and I am sure you all are busy so I would just like to step back and revisit when the issue may be addressed. thanks!

To track the issue best would be to create an issue in TF-DF github – we’ve been using github to track those.

On the 439 help requests (and growing), those are for the whole TensorFlow (and its many sub-projects) … we (as in TF-DF team) definitely wouldn’t be able to handle those many questions/issues :slight_smile:

1 Like

Thanks Jan, I will do that now that I see you have realeased version 2.6. Congrats!