The naive option is to use something like this:
import tensorflow as tf
import numpy as np
import collections
num_classes = 2
num_samples = 10000
data_np = np.random.choice(num_classes, num_samples)
y = collections.defaultdict(int)
for i in dataset:
cls, _ = i
y[cls.numpy()] += 1
Bhack
March 8, 2022, 2:44pm
#3
If you are looking for a non-numpy solution there was a API request at:
opened 05:00AM - 01 Jan 21 UTC
enhancement
**Is your feature request related to a problem? Please describe.**
My NN result… s are good. Too good. So I'm trying to figure out why it's so good. Maybe there are duplicates somewhere?
**Describe the solution you'd like**
Would be great to just count the duplicates.
**Describe alternatives you've considered**
I'm sure I'll end up hashing everything matrix into strings or [more efficiently… assuming it fits] scalars. Then deduplicate that list.
**Additional context**
Might be good to have a proper summary, combining https://stackoverflow.com/a/60877708 to get the counts per label, with this solution (unique counts per label, counts per split, and total counts for each).
EDIT: Just found https://www.tensorflow.org/tfx/guide/tfdv, https://www.tensorflow.org/tfx/data_validation/get_started and reading https://pair-code.github.io/facets/ now…
Bhack
March 8, 2022, 2:51pm
#4
With numpy you could use many solutions like:
Not sure how these methods would scale.
I would give this one a try:
Bhack
March 8, 2022, 3:21pm
#6
It is doing something similar iterating over the full dataset but in c++:
// Iterate through the input dataset.
while (true) {
if (ctx->cancellation_manager()->IsCancelled()) {
return errors::Cancelled("Operation was cancelled");
}
std::vector<Tensor> next_input_element;
bool end_of_input;
TF_RETURN_IF_ERROR(
iterator->GetNext(&iter_ctx, &next_input_element, &end_of_input));
if (end_of_input) {
break;
}
// Run the reduce function to update the current state.