I have a huge dataset (1TB) with thousand of small hdf5 files, each consisting out of two 3D numpy arrays (only float64 numbers), which currently are fetched by a generator which is given to the tf.data.Dataset.from_generator function. Since I cant cache my data, the data fetching process is quite slow. Now I want to use all my CPUs and parallel fetch from my dataset. Here is my code:
def generator(files):
for file in files:
with h5py.File(file, 'r') as hf:
epsilon = hf['epsilon'][()]
field = hf['field'][()]
yield epsilon, field
dataset = tf.data.Dataset.from_generator(pygen.generator, args=[files],output_signature=(
tf.TensorSpec(shape=s[0], dtype=tf.float64), tf.TensorSpec(shape=s[1], dtype=tf.float64)))
Is there a best-practise/solution to this problem?