TF Dataset.window() not returning useful Dataset objects

Richard_Neuboeck · January 16, 2024, 3:13am

Hi knowledgable people,

I have one hot encoded biological data of different length as feature and a simple float as label. Due to the length differences I’m building a ragged tensor from my feature array and combine this tensor and the label numpy array into one dataset object.

X = tf.ragged.constant(X_np, dtype=tf.int8, ragged_rank=1, row_splits_dtype=tf.int32)
train_dataset = tf.data.Dataset.from_tensor_slices((X, y))
train_dataset.element_spec

(TensorSpec(shape=(None, 4), dtype=tf.int8, name=None),
 TensorSpec(shape=(), dtype=tf.float64, name=None))

I want to use this dataset for a k-fold cross validation run. My thought was to use the dataset.window() method to split the dataset into multiple pieces, use one as validation set and concatenate the others to form the training set and repeat k times. The documentation states that .window() returns a dataset of datasets that one can loop over. The simple example given works like a charm. But using my own data it does not and so far I can’t figure out why.

This code creates the pieces but trying to inspect the element_spec or accessing the dataset method .concatenate ends up in an error.

for w in train_dataset.window(math.ceil(len(train_dataset) / num_splits)):
    print(w)
    w.element_spec

(<_VariantDataset element_spec=TensorSpec(shape=(None, 4), dtype=tf.int8, name=None)>, <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.float64, name=None)>)
...
AttributeError: 'tuple' object has no attribute 'element_spec'

The error is clear that the tuple doesn’t have an element_spec attribute. But why is it a tuple and not the expected dataset object?

I’m probably overlooking something simple or maybe I’m approaching this problem from the wrong angle all together so I would be highly appreciative if somebody could give me some pointers or point me in a direction that works.

THANKS!
Richard

PS: I’m using the dataset because due to the in-homogeneous feature array I need the padded_batch function for the model training. I could of course use numpy arrays up to that point however building the ragged tensor from the numpy array is a very time consuming task so if possible I would prefer doing that only once.

Kiran_Sai_Ramineni · January 17, 2024, 10:12am

Hi @Richard_Neuboeck, If you create a dataset

dataset = tf.data.Dataset.from_tensor_slices((x,y))

then the dataset objects will be of type tuple

for w in dataset.window(3):
  print(type(w))
#output
<class 'tuple'>
<class 'tuple'>

If you create the dataset using

dataset = tf.data.Dataset.from_tensor_slices([x,y])

then the dataset objects will be of type <class ‘tensorflow.python.data.ops.dataset_ops._VariantDataset’>

for w in dataset.window(3):
  print(type(w))
#output
<class 'tensorflow.python.data.ops.dataset_ops._VariantDataset'>

If dataset objects is of type <class ‘tensorflow.python.data.ops.dataset_ops._VariantDataset’>
then you can use w.element_spec

please refer to this gist for code example. Thank You

Richard_Neuboeck · January 18, 2024, 12:38am

Hi @Kiran_Sai_Ramineni,

thank you very much for your feedback!

The approach to supply .from_tensor_slices() with a list .from_tensor_slices([X, y]) instead of a tuple failed with the error "TypeError: object of type 'RaggedTensor' has no len()".

I can’t change the fact that this tensor is ragged due to the in-homogeneous data. But digging a bit deeper I found a workaround where I take the tuple returned from .window() and zip it into a new dataset which I then can use to call .padded_batch.

for w in train_all_dataset.window(math.ceil(len(train_all_dataset) / num_splits)):
    print(w)
    print(type(w))
    tmp = tf.data.Dataset.zip(w)
    print(type(tmp))
    tmp = tmp.padded_batch(batch_size)

Output:

(<_VariantDataset element_spec=TensorSpec(shape=(None, 4), dtype=tf.int8, name=None)>, <_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.float64, name=None)>)
<class 'tuple'>
<class 'tensorflow.python.data.ops.zip_op._ZipDataset'>

I’m not sure if that’s the “correct” way to do this and it seems ugly but if no other issues come up this looks like to be more efficient than to re build the ragged tensor from a previously split numpy array for each fold.

Thanks