How to speedup input pipeline beyond vectorize and num_parallel_calls?


I have 300M 192 one-hot inputs encoded as 24 8-byte ints residing in memory. I have defined a function to unpack them:

def unpackbits_tf(features, labels):
mask = tf.constant([128, 64, 32, 16, 8, 4, 2, 1], dtype=features.dtype)
expanded_features = tf.expand_dims(features, -1)
unpacked = tf.cast(tf.bitwise.bitwise_and(expanded_features, mask) > 0, tf.uint8)
return tf.reshape(unpacked, [-1, features.shape[1] * 8]), labels

And I have defined my input pipeline as:

with tf.device(“CPU”):
training_dataset =, training_labels)).batch(BATCH_SIZE).map(unpackbits_tf, num_parallel_calls =
validation_dataset =, validation_labels)).batch(BATCH_SIZE).map(unpackbits_tf, num_parallel_calls =

The Tensorflow profiler (not surprisingly) flagged the input pipeline as a bottleneck, but vectorize by placing the map call after the batch call and the num_parallel_calls have no effect at all. With small BATCH_SIZE an epoch takes about 2 hours, but if you increase the BATCH_SIZE an epoch settles to a constant value of about 5 minutes that does not change whether I use vectorize and/or num_parallel_calls. cache has no effect (the unpacked data resides in memory) and if I use prefetch the kernel crashes, I guess because the unpacked data does not fit in memory.

Why do vectorize and num_parallel_calls have no effect? Any other things I can try to speed up the input pipeline?


Hi GW,

a couple of things caught my eye:

  • tf data pipelines normally run on CPU, I don’t think there’s any need for explicit device placement. Maybe that’s a reason for the prefetch crash
  • operating on batches instead of samples should give a significant speedup (at least theoretically), but remember that your map function should then operate on batches as well. There’s more to it than just placing the .batch call before the .map one, you should rewrite unpackbits_tf so it operates correctly on tensors containing a full batch.
  • caching the dataset should help as well for every epoch after the first, if you place the cache after the map call (since then you’re caching the results of the unpack during the first epoch and reuse them directly during the next epochs)
  • other than that, an alternative might be to perform the unpacking offline before training.

Good luck,

Hi Stephen,

Thank you for your suggestions!

Regarding pipeline placement on the CPU: see Getting memory error when training a larger dataset on the GPU)
Regarding operating on batches instead of samples: unpackbits_tf also works on tensors containing a full batch.
Regarding caching the dataset: it seems that caching tries to cache the entire unpacked dataset, but the entire unpacked dataset does not fit in memory.
Regarding unpacking offline before training: I started with the unpacked dataset as a CSV file, but training from CSV files was VERY slow which is why I switched to a packed dataset that can be completely loaded in memory, but it has of course to be unpacked for training. Meanwhile I also tried converting the CSV file to TFRecord or Parquet files and use those instead of CSV files, but in both cases you need a ‘map’ function to use the data during training and that slows things down again.