Parallel data extraction with tf.data.Dataset.from_generator

I have a huge dataset (1TB) with thousand of small hdf5 files, each consisting out of two 3D numpy arrays (only float64 numbers), which currently are fetched by a generator which is given to the tf.data.Dataset.from_generator function. Since I cant cache my data, the data fetching process is quite slow. Now I want to use all my CPUs and parallel fetch from my dataset. Here is my code:

def generator(files):
    for file in files:
        with h5py.File(file, 'r') as hf:
            epsilon = hf['epsilon'][()]
            field = hf['field'][()]
        yield epsilon, field
 dataset = tf.data.Dataset.from_generator(pygen.generator, args=[files],output_signature=(
  tf.TensorSpec(shape=s[0], dtype=tf.float64),  tf.TensorSpec(shape=s[1], dtype=tf.float64)))

Is there a best-practise/solution to this problem?

Hi @munsteraner, You can consider using tf.data.Dataset.prefetch this allows later elements to be prepared while the current element is being processed. Thank You.