Complex (NumPy) transformations possible in tf.data.Dataset pipeline?

Hi,

I want to use tf.data.Dataset as main building block in my data pipeline for training a neural network with Tensorflow that deals with time series. Ideally without resorting to custom dataloader classes.

Question: How do you perform processing that requires more than what Tensorflow can express? I.e. operations that require for example Numpy input and therefore cannot be integrated in the Tensorflow graph.

Example: Given time series data, I would like to resample the data to be able to use time series data from different sources in a single training dataset. How can that be achieved?

The reasoning behind the pipeline-integrated transformations are that those transformations only take around 10min on the whole dataset that I use. Hence, I am happy to perform them prior to training instead of deriving a dedicated dataset once.

I am aware of similar questions here (like this). Also, I am aware of Tensorflow Transform and Keras preprocessing layers. None of those options allow for example interpolation. There exists a TF implementation for interpolation - but that only works on an equidistant grid unfortunately. An interesting implementation of interpolation in TensorFlow is this one; however, I would much prefer to existing implementations in SciPy or NumPy.

What is your workflow to implement preprocessing steps that are easy with NumPy and alike if one-time performance is not crucial? Maybe using a custom dataloader is in fact easier than relying on tf.data.Dataset for those preprocessing steps?

Thanks and best wishes!

Hi @MaL ,

Welcome to the TensorFlow Forum!

You can convert the numpy dataset into the tf.data.dataset format using any of the tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices().

import numpy as np
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10000))
dataset

then define the function for preprocessing as requirement and map that function to the entire dataset. Please refer these resampling and timeseries technique to tackle with different imbalanced datasets for preprocessing.

1 Like