Possible to interleave reading files when using `tf.data`?

I have the following data reading utility:

def read_files(image_path, text_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, IMG_SZ, antialias=True)
    
    text = tf.io.read_file(text_path)
    text = tf.compat.as_str_any(text)
    return image, text

This is how I am constructing the dataset:

dataset = tf.data.Dataset.zip((image_ds, text_ds))
dataset = dataset.map(read_files, num_parallel_calls=AUTO).cache()

Is there a way to interleave the read_files() function?

Do you mean: tf.data.Dataset.zip((image_ds.map(read_img), text_ds.map(read_text)))?

No I meant using the actual interleave() method. I guess it’s probably not doable in this case since the map_func passed to interleave() is supposed to return a Dataset?

Why? what problem are you trying to solve?

Did you know that you can pass datasets through datasets?

datasets = [a, b, c,]
meta = tf.data.Dataset.from_tensor_slices(datasets)
merged = meta.interleave(lambda x:x)

If you did that with your image_ds, text_ds then you’d have a dataset that alternates between image and text paths… but why? A dataset needs to have a single spec. So you can’t load the files in each dataset, and then interleave those.

I can think of a few solutions, they just seem worse than the map before zip above:

  1. .batch(2).map(read_files)
  2. have both read_img and read_text return (image, text) pairs, where the first axis of text is length 0 for read_image and the first axis of image is zero for read_text.

I wanted to if it’s possible to interleave the reading of files. I am probably conceptually mistaken. I hope that’s not blasphemy.

So in your opinion, tf.data.Dataset.zip((image_ds.map(read_img), text_ds.map(read_text))) would be more efficient than what I am already doing?

interleave the reading of files

You mean run the two reads in parallel?

I think even in your initial implementation TensorFlow is supposed to notice that the image and text branch are independent, and execute them in parallel.

I see. If that’s the case, then all’s well I guess.