How to Sort a tf.data.dataset?

I have a tf.data.dataset that contains features and a probability. (I created the dataset by zipping my test dataset with the probabilities predicted by my binary classification model, thereby adding a probability “column” to the test dataset.)

I want to sort this dataset in descending order by probability. Can I do so directly, without resorting to converting the dataset to numpy or a pandas dataframe?

1 Like

if you want to do visualisation, I’d suggest you do with numpy (using something like dataset.as_numpy_iterator()), it will be the easiest path

Thanks, but is there really no way of working more directly on a tf.data.dataset and thereby maintain the lazy evaluation, caching, and consistency that datasets afford?

Are your sorting needs for balancing like:

My needs resemble those described in the thread you referenced but are more centered around sorting. Using a pandas dataframe, I can do:

top_scored_test_data = scored_test_data.sort_values(by = 'prediction', ascending = False)[:10]

Being able to do something similar with a tf.data.dataset would be convenient and potentially not require loading all the data into memory at the same time (or not require loading and sorting it until it is actually used).

1 Like

Hi were you able to find any solution to this requirement. If so , can you please share.
Thanks.

Hi @Sudh_Kumar, Once you have created a dataset with features and probability by using tf.data.Dataset.zip you can sort the dataset based upon the probabilities using the below code line.

#instead of 1 you have to change it according to the probability column in your dataset
sorted_dataset = sorted(dataset, key=lambda x: x[1]

please refer to this gist for working code example. Thank You.

Hi Kiran, thanks for the reply.
But i am unable to use this as my TensorFlow Dataset is a prefetchDataset as i have use make_csv_dataset() function to read csv directly to tf dataset as it is a large dataset and i did not want to convert to pandas, numpy or list .
So is there a way i can sort my prefetch tensorflow dataset.

Thanks

1 Like