Recommended way to save/load data to/from disk to tf.data.Dataset

It is unclear to me how Tensorflow recommends ways to read/write data from/to disk and consume as a tf.data.Dataset.

Background: data is too large to store in memory, so how to efficiently load data at run time to train the model?

It is unclear whether the recommended way is to use the TFRecord format to serialize examples or instead just manually create dataset shards and save to disk. The TFRecord process seems extremely convoluted, whereas manually creating many small datasets then saving each to disk with dataset.save() seems much more straightforward.

Currently, I am doing the latter to deal with large datasets. That is, create shards by saving many smaller datasets to disk and then during train time, I use tf.dataset.load() to load each and concatenate them. It appears the final dataset is not loaded into memory at once, but rather asynchronously reads the dataset shards during train time.

What other ways do people deal with datasets that cannot fit in memory?

very interesting question, I have the same.

But first I need to create a td.data.Dataset because I have thousands of relatively “small” dataframes stored in csv - more than 100 Mb each… As I understand it’s recommended to merge them in one Dataset object to work with TF and Keras and to develop a NN for classification.

right… that is a similar pattern to what i am doing. the steps i take are as follow:

  1. create tf.data.Dataset for each of the “small” dataframes and save to disk using tf.data.Dataset.save(path_i)
  2. load each dataset using tf.data.Dataset.load(path_i) and concatenate together. TF does not actually load each dataset into memory when using .load(), but rather dynamically loads it when it is needed. so now you have a single dataset that points to the various datasets on disk
  3. loop through this single dataset during training, be sure to use prefetch() to help speedup the file i/o

is this something that you have tried as well? i assume you cannot create a single dataset object on disk because it is too memory intensive. this seems to be the only workaround besides trying TFRecord

I am a total newbie, I haven’t tried yet, sorry. Here is my actual task which I am trying to solve using tensorflow: Neural Network design using Tensorflow and Keras - for signal processing and noised peaks detection

I am trying to plan the data processing flow - because I have large amounts of measured data in various files. Various files have various data - I mean the number of ‘rows’, varying ranges - max and min values for feature variables.

As I understand it’s one of the most important parts - to prepare a dataset first, - to use it properly.
So I am trying to understand the options and patterns I have with Tensorflow.

If you have found some examples - please provide me with a link or advice.

You wrote also:

load each dataset using tf.data.Dataset.load(path_i) and concatenate together. TF does not actually load each dataset into memory when using .load(), but rather dynamically loads it when it is needed. so now you have a single dataset that points to the various datasets on disk

So if I have a 1000 “small” pandas dataframes - I can convert them to tf.data.Dataset objects and save it using separate files and then concatenate one by one to save all the data to large tf.data.Dataset object?
That is a most common strategy?

In this case - the resulting large tf.data.Dataset - is one file or it’s a set of files? How the tf.data.Dataset object is stored in case of a large number of rows?

I mean - if I will concatenate all the files to one object on my workstation - how can I load that object using a server - just download a file? is it a simple file with special format or a set of files?

I’m wondering why you are breaking up dataset into multiple parts, save, load each of them then merge. Why not simply save one big dataset? When you are using save it loads all data into memory but then when loading it streams the data at each iteration?

Loading into RAM in order to save is already too much for many use-cases. Not to mention that it is not a good practice. I think people seek for more robust ways of handling dataset cache.

Indeed. the issue is the entire dataset is > 1TB, so it is impractical and impossible on most machines to load the entire dataset into RAM. Instead, a common procedure is to “shard” or split the dataset into smaller, more manageable files and then iterate over these “mini” datasets.

In practice, Tensorflow has some cool features that allow you to point to the files and they will be loaded asynchronously, so you avoid waiting for the files to be loaded into memory. Think of it as a buffer.

Unfortunately, there are some issues in practice with this (Memory Leak with Tensorflow Experimental Save · Issue #56177 · tensorflow/tensorflow · GitHub) but there are workarounds. It would be nice if Tensorflow had better documentation on this application of datasets, as it is quite useful and practical.

Right, loading is well implemented. But people are well-aware of that and want something more. The real problem is a lack of a one-liner and efficient cache op, implemented the state-of-the-art way: streaming in constant memory into shards. Writing custom code somewhat undermines the usefulness of the dataset api.

1 Like