Dataset from generator shuffling

Hi,

I have made a dataset from generator like:

ds_series = tf.data.Dataset.from_generator(
trim_size, args=[data_input_tot_EqLen, trimmed_lbl, seq_len, max_len_per],
output_types=(tf.float32, tf.int32),
output_shapes=((5511, 101, 3), (1)))

then I shuffle the dataset and split it to training and testing:

ds_series= ds_series.shuffle(buffer_size=16)
ds_train=ds_series.take(train_smpls)
ds_valid=ds_series.skip(train_smpls)

I’d like to count the number of samples in each class, therefore, I’d like to see what labels would be assigned to the training and testing dataset.

I run the following command:

_, lbl_train = ds_train

this take a lot of time (I understand this because trim_size I defined above in pretty heavy) but my question is related to the messages that it shows:

I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 1 of 16

so it counts filling up the buffer from 1 to 16. however, this does not fit with what has mention about shuffle buffer size in the documentation:

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle

it is supposed to take random samples from a 16 sample-buffer which means that the randomization process is not limited to 16.

Am I wrong here?

Hi @Afshin_Samani

Welcome to the TensorFlow Forum!

Yes, The buffer_size means the number of elements to keep in memory for shuffling and then randomly samples elements from this buffer, replacing the selected elements with new elements as mentioned the same in the dataset.shuffle() definition. This fetches a new element from the dataset to replace the selected one in the buffer to maintain a full buffer and this process continues to ensure that elements are randomly shuffled before being yielded.

Please see the example description:

For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer.

For the Error - I tensorflow/core/kernels/data/shuffle_dataset_op.cc:175] Filling up shuffle buffer (this may take a while): 1 of 16

This informational message indicates the buffer filling process in memory and you will not see these messages anymore once the buffer is full and dataset shuffling will start from the buffer.

Thank you.