How to get the label distribution of a `tf.data.Dataset` efficiently?

Sayak_Paul · March 8, 2022, 2:43am

The naive option is to use something like this:

import tensorflow as tf 
import numpy as np
import collections

num_classes = 2
num_samples = 10000
data_np = np.random.choice(num_classes, num_samples)

y = collections.defaultdict(int)
for i in dataset:
  cls, _ = i
  y[cls.numpy()] += 1

Bhack · March 8, 2022, 2:44pm

If you are looking for a non-numpy solution there was a API request at:

github.com/tensorflow/datasets

Count duplicates… e.g., `tf.data.experimental.unique` + `tf.data.experimental.cardinality`

opened 05:00AM - 01 Jan 21 UTC

SamuelMarks

enhancement

**Is your feature request related to a problem? Please describe.** My NN result…s are good. Too good. So I'm trying to figure out why it's so good. Maybe there are duplicates somewhere? **Describe the solution you'd like** Would be great to just count the duplicates. **Describe alternatives you've considered** I'm sure I'll end up hashing everything matrix into strings or [more efficiently… assuming it fits] scalars. Then deduplicate that list. **Additional context** Might be good to have a proper summary, combining https://stackoverflow.com/a/60877708 to get the counts per label, with this solution (unique counts per label, counts per split, and total counts for each). EDIT: Just found https://www.tensorflow.org/tfx/guide/tfdv, https://www.tensorflow.org/tfx/data_validation/get_started and reading https://pair-code.github.io/facets/ now…

Bhack · March 8, 2022, 2:51pm

With numpy you could use many solutions like:

Sayak_Paul · March 8, 2022, 3:10pm

Not sure how these methods would scale.

I would give this one a try:

Bhack · March 8, 2022, 3:21pm

It is doing something similar iterating over the full dataset but in c++:

github.com

tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/reduce_dataset_op.cc#L89-L102

      
        
            // Iterate through the input dataset.
            while (true) {
              if (ctx->cancellation_manager()->IsCancelled()) {
                return errors::Cancelled("Operation was cancelled");
              }
              std::vector<Tensor> next_input_element;
              bool end_of_input;
              TF_RETURN_IF_ERROR(
                  iterator->GetNext(&iter_ctx, &next_input_element, &end_of_input));
              if (end_of_input) {
                break;
              }
            
            
  // Run the reduce function to update the current state.