How to use generators with tfdf.keras.pandas_dataframe_to_tf_dataset

Hi all,
Novice question here. I have a very large dataset that I want to feed into a tfdf model. How can I couple tfdf.keras.pd_dataframe_to_tf_dataset() to a generator feeding the data to it? Of course, if there are better methods for feeding a generator into tfdf dataset than using this method I’d be interested to know.

Many thanks in advance,
Doug

Hi Doug,

tfdf.keras.pd_dataframe_to_tf_dataset is just a convenience method if you already have a dataframe. If you are starting off with a generator, you can use it to create a tf.data.Dataset directly from it tf.data.Dataset  |  TensorFlow Core v2.8.0, which might be a better fit – TF-DF datasets are not special, and any dataset object created for another keras model should work.

There are the following caveats:

  1. Your dataset still needs to fit in memory at training time. If this is not the case, you can do something like dataset = dataset.take(10000000) to subsample 10 million rows (the exact number will depend on the number of features + memory capacity)
  2. There are a few data sanitization steps happening in pandas_dataframe_to_tf_dataset like making sure feature names don’t have spaces or other forbidden characters, you might want to consult the source code and copy that logic as is appropriate.

Hope that helps!
Arvind

2 Likes

Hi @Arvind, many thanks for the helpful comment!