It is unclear to me how Tensorflow recommends ways to read/write data from/to disk and consume as a tf.data.Dataset.
Background: data is too large to store in memory, so how to efficiently load data at run time to train the model?
It is unclear whether the recommended way is to use the TFRecord format to serialize examples or instead just manually create dataset shards and save to disk. The TFRecord process seems extremely convoluted, whereas manually creating many small datasets then saving each to disk with dataset.save() seems much more straightforward.
Currently, I am doing the latter to deal with large datasets. That is, create shards by saving many smaller datasets to disk and then during train time, I use tf.dataset.load() to load each and concatenate them. It appears the final dataset is not loaded into memory at once, but rather asynchronously reads the dataset shards during train time.
What other ways do people deal with datasets that cannot fit in memory?
very interesting question, I have the same.
But first I need to create a td.data.Dataset because I have thousands of relatively “small” dataframes stored in csv - more than 100 Mb each… As I understand it’s recommended to merge them in one Dataset object to work with TF and Keras and to develop a NN for classification.
right… that is a similar pattern to what i am doing. the steps i take are as follow:
- create tf.data.Dataset for each of the “small” dataframes and save to disk using tf.data.Dataset.save(path_i)
- load each dataset using tf.data.Dataset.load(path_i) and concatenate together. TF does not actually load each dataset into memory when using .load(), but rather dynamically loads it when it is needed. so now you have a single dataset that points to the various datasets on disk
- loop through this single dataset during training, be sure to use prefetch() to help speedup the file i/o
is this something that you have tried as well? i assume you cannot create a single dataset object on disk because it is too memory intensive. this seems to be the only workaround besides trying TFRecord
I am a total newbie, I haven’t tried yet, sorry. Here is my actual task which I am trying to solve using tensorflow: Neural Network design using Tensorflow and Keras - for signal processing and noised peaks detection
I am trying to plan the data processing flow - because I have large amounts of measured data in various files. Various files have various data - I mean the number of ‘rows’, varying ranges - max and min values for feature variables.
As I understand it’s one of the most important parts - to prepare a dataset first, - to use it properly.
So I am trying to understand the options and patterns I have with Tensorflow.
If you have found some examples - please provide me with a link or advice.
You wrote also:
load each dataset using tf.data.Dataset.load(path_i) and concatenate together. TF does not actually load each dataset into memory when using .load(), but rather dynamically loads it when it is needed. so now you have a single dataset that points to the various datasets on disk
So if I have a 1000 “small” pandas dataframes - I can convert them to tf.data.Dataset objects and save it using separate files and then concatenate one by one to save all the data to large tf.data.Dataset object?
That is a most common strategy?
In this case - the resulting large tf.data.Dataset - is one file or it’s a set of files? How the tf.data.Dataset object is stored in case of a large number of rows?
I mean - if I will concatenate all the files to one object on my workstation - how can I load that object using a server - just download a file? is it a simple file with special format or a set of files?