Dataset of Datasets pipeline

Rafael_Rafailov · July 25, 2021, 12:45pm

In my project I have several different classes of videos with thousands of videos in any one class. The videos are too big to be all loaded into memory. My model loads batches of videos from each class and creates synthetic videos by applying a function to a batch of videos from different classes. My problem is how to efficiently load the data? I have created a generator which samples some class ids, loads videos from each class and creates the synthetic video. This works, but even using the recommended dataset pipeline data loading and manipulation still takes 90% of training time. I tried creating separate dataset for each class and trying to sample from them in parallel using a zip dataset, which also works and is faster. However, I’m doing distributed training along multiple GPUs, and thus need a distributed dataset, so how can I create a distributed dataset of datasets? The goal is to have a dataset that takes in a batch of video batches and applies a function that batch to produce a meta-batch?

pritamdodeja · May 26, 2022, 5:33am

@Rafael_Rafailov have you considered doing the pre-processing offline? It sounds like your pipeline is all in tf.data. I think there may be an opportunity to trade space for time by doing this ahead of time, and parallelizing using apache beam/dataflow through tensorflow transform.