Randomly sampling equal points ensuring equal number per class

Sayak_Paul · May 5, 2021, 8:07am

Hi folks.

Currently, I have a requirement for a batch of data that should have an equal number of samples from each of the given classes.

I am implementing it using the naive way for CIFAR10:

def support_sampler():
    idx_dict = dict()
    for class_id in np.arange(0, 10):
        subset_labels = sampled_labels[sampled_labels == class_id] 
        random_sampled = np.random.choice(len(subset_labels), 16)
        idx_dict[class_id] = random_sampled
    return np.concatenate(list(idx_dict.values()))

def get_support_ds():
    random_balanced_idx = support_sampler()
    temp_train, temp_labels = sampled_train[random_balanced_idx],\
        sampled_labels[random_balanced_idx]
    support_ds = tf.data.Dataset.from_tensor_slices((temp_train, temp_labels))
    support_ds = (
        support_ds
        .shuffle(BATCH_SIZE * 1000)
        .map(agumentation, num_parallel_calls=AUTO)
        .batch(BATCH_SIZE)
    )
    return support_ds

Is there a better way? Particularly using pure TF ops with tf.data?

markdaoust · May 6, 2021, 1:56pm

Here the approach I used was to make a dataset for each class, and then merge them.

I used sample_from_datasets so it’s approximately equal. But you could also zip the datasets then and .map a function to stack all the zipped tensors.

Sayak_Paul · May 6, 2021, 2:08pm

Thanks Mark. I later revisted that tutorial and found out about that neat method. Solved my purpose.

I think having a separate sampler utility for tf.data pipelines might be better from usability standpoint.

markdaoust · May 6, 2021, 2:24pm

There is this “rejection resample” function:

Sayak_Paul · May 6, 2021, 2:41pm

Oh my. This is really neat. Thanks for sharing.

I need to extend the example for my use case.

Sayak_Paul · May 7, 2021, 11:40am

@markdaoust here’s what I tried:

def class_func(image, label):
    return label

SUPPORT_BATCH_SIZE = 640

(x_train, y_train), (_, _) = tf.keras.datasets.cifar10.load_data()

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()
sampled_labels = sampled_labels.astype("int32")
support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))

distribution = Counter(sampled_labels)
counts = np.array(list(distribution.values()))
fractions = counts/counts.sum().astype("float64")

target_distribution = np.array([0.1] * 10).astype("float64")
resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=target_distribution, initial_dist=fractions)
support_ds = support_ds.apply(resampler).batch(SUPPORT_BATCH_SIZE)

Here’s the root error:

TypeError: Input 'y' of 'Less' Op has type float64 that does not match type float32 of argument 'x'.

Any idea what I might have missed out on? Here’s the Colab if you wanna give it a shot.

markdaoust · May 7, 2021, 12:32pm

To get your code to work, replace:

fractions = counts/counts.sum().astype("float64")

target_distribution = np.array([0.1] * 10).astype("float64")

with:

fractions = counts/counts.sum()
fractions = fractions.astype("float32")

target_distribution = np.array([0.1] * 10).astype("float32")

The implementation is just being a bit careless with the dtypes.

Here it’s does a random_ops.random_uniform([], seed=seed) < p)).

That uniform random returns a float32. So p needs to be float32, or it should say random_ops.random_uniform([], seed=seed, dtype=p.dtype)

Or it should assert that all those arguments are float32, or cast them to float32.

Sayak_Paul · May 7, 2021, 12:53pm

That worked. Thank you.

Indeed, dtype part was confusing to understand.

Sayak_Paul · May 7, 2021, 1:06pm

Although the code is working fine, the distribution is not what I would expect (the expectation here is to have a uniform distribution across the labels). Here’s a batch-wise summary:

Counter({6: 73, 1: 72, 7: 71, 5: 67, 0: 65, 8: 64, 9: 63, 4: 57, 3: 55, 2: 53})
Counter({9: 74, 0: 70, 4: 70, 2: 69, 3: 68, 1: 66, 7: 62, 6: 56, 5: 53, 8: 52})
Counter({0: 75, 3: 71, 6: 70, 1: 69, 8: 64, 9: 63, 4: 63, 2: 60, 7: 55, 5: 50})
Counter({4: 74, 0: 72, 7: 72, 1: 67, 5: 66, 6: 65, 3: 63, 9: 59, 2: 52, 8: 50})
Counter({2: 78, 7: 78, 6: 75, 1: 68, 4: 62, 5: 62, 9: 56, 0: 56, 3: 55, 8: 50})

For 640 samples with each batch, I would expect it to give 64 per class.

Sayak_Paul · May 8, 2021, 2:34am

I tried another approach:

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()
sampled_labels = sampled_labels.astype("int32")
support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))

ds = []

for i in np.arange(0, 10):
    ds_label =  (
        support_ds
        .filter(lambda image, label: label==i)
        .repeat())
    ds.append(ds_label)

balanced_ds = tf.data.experimental.sample_from_datasets(
    ds, [0.1] * 10).batch(SUPPORT_BATCH_SIZE)

But here also when I do:

for samples, labels in balanced_ds.take(10):
    print(Counter(labels.numpy()))

the distribution does not come out as expected:

Counter({9: 74, 0: 73, 3: 71, 8: 70, 1: 70, 5: 67, 7: 64, 2: 55, 6: 51, 4: 45})
Counter({2: 76, 3: 70, 4: 68, 1: 67, 6: 64, 0: 62, 7: 62, 8: 60, 9: 56, 5: 55})
Counter({1: 78, 2: 75, 7: 74, 0: 68, 9: 67, 3: 61, 5: 58, 8: 55, 4: 54, 6: 50})
Counter({6: 82, 9: 69, 5: 68, 4: 64, 1: 63, 3: 62, 7: 62, 8: 61, 2: 56, 0: 53})
Counter({6: 76, 2: 69, 5: 69, 8: 68, 4: 67, 0: 66, 1: 59, 3: 59, 9: 55, 7: 52})
Counter({8: 77, 9: 71, 4: 68, 0: 66, 2: 66, 6: 66, 7: 64, 5: 62, 1: 60, 3: 40})
Counter({8: 86, 9: 66, 4: 65, 1: 64, 2: 62, 5: 61, 0: 60, 6: 60, 3: 58, 7: 58})
Counter({7: 75, 8: 73, 6: 70, 5: 70, 3: 68, 9: 64, 4: 61, 0: 55, 2: 53, 1: 51})
Counter({6: 78, 1: 70, 5: 67, 0: 66, 2: 66, 4: 64, 8: 60, 3: 58, 9: 56, 7: 55})
Counter({9: 75, 7: 70, 8: 69, 3: 67, 4: 65, 5: 63, 2: 62, 1: 57, 0: 57, 6: 55})

@markdaoust

markdaoust · May 8, 2021, 1:17pm

Don’t trust a person’s ability to evaluate a probability distribution at a glance.

Here’s an independent implementation that gets equivalent results:

import numpy as np

for _ in range(10):
  d = np.zeros(10)
  for n in range(640):
    d[np.random.randint(10)] += 1
  print(sorted(d, reverse=True))

[79.0, 71.0, 69.0, 68.0, 65.0, 61.0, 60.0, 59.0, 58.0, 50.0]
[78.0, 70.0, 70.0, 68.0, 67.0, 64.0, 62.0, 57.0, 56.0, 48.0]
[78.0, 73.0, 70.0, 69.0, 67.0, 62.0, 59.0, 57.0, 53.0, 52.0]
[74.0, 71.0, 70.0, 68.0, 66.0, 61.0, 61.0, 60.0, 56.0, 53.0]
[77.0, 70.0, 67.0, 65.0, 65.0, 63.0, 62.0, 60.0, 57.0, 54.0]
[76.0, 73.0, 68.0, 67.0, 66.0, 61.0, 59.0, 58.0, 56.0, 56.0]
[74.0, 74.0, 70.0, 69.0, 68.0, 67.0, 65.0, 59.0, 48.0, 46.0]
[85.0, 69.0, 68.0, 66.0, 62.0, 61.0, 61.0, 59.0, 56.0, 53.0]
[73.0, 71.0, 67.0, 67.0, 65.0, 63.0, 61.0, 58.0, 58.0, 57.0]
[72.0, 70.0, 68.0, 67.0, 65.0, 63.0, 62.0, 60.0, 59.0, 54.0]

I’m not sure what the right statistical test is (something Dirichlet.) but use a bigger sample size and you’ll see that it’s converging. with 1e6 samples everything’s within 1%:

d = np.random.randint(10, size=int(1e6))
counts, _ = np.histogram(d, bins=range(11))
counts

array([100254,  99351, 100098, 100162,  99747, 100369,  99793, 100247,
       100039,  99940])

If you you want to force exact balance then with one dataset per class you can:

import tensorflow as tf

datasets = tuple(tf.data.Dataset.from_tensors(n).repeat() for n in range(10))
zipped = tf.data.Dataset.zip(datasets)
stacked = zipped.map(lambda *args: tf.stack(args, axis=0))

stacked.element_spec

TensorSpec(shape=(10,), dtype=tf.int32, name=None)

tf.data.experimental.get_single_element(stacked.take(1))

<tf.Tensor: shape=(10,), dtype=int32, 
  numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>

Sayak_Paul · May 8, 2021, 1:45pm

Thanks for the pointers.

markdaoust · May 11, 2021, 12:32am

Also:

Sayak_Paul · May 11, 2021, 8:44am

@markdaoust it just keeps getting interesting:

What I exactly wanted:

Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})
Counter({0: 64, 1: 64, 2: 64, 3: 64, 4: 64, 5: 64, 6: 64, 7: 64, 8: 64, 9: 64})

Crux of the code:

def dataset_for_class(i):
    i = tf.cast(i, tf.uint8)
    return support_ds.filter(lambda image, label: label == i).repeat()

sampled_idx = np.random.choice(len(x_train), 4000)
sampled_train, sampled_labels = x_train[sampled_idx], y_train[sampled_idx].squeeze()

support_ds = tf.data.Dataset.from_tensor_slices((sampled_train, sampled_labels))
stratified_ds = tf.data.Dataset.range(10).interleave(dataset_for_class, cycle_length=10) 
stratified_ds = stratified_ds.batch(640)

Notes:

Dataset is CIFAR10.
I made sure that the images getting batched are different as you would ntoice in the notebook provided above.

markdaoust · May 11, 2021, 3:55pm

Yeah, that interleave is basically equivalent to the zip.

def dataset_for_class(i):
    i = tf.cast(i, tf.uint8)
    return support_ds.filter(lambda image, label: label == i).repeat()

Just remember that if you’re splitting a dataset like that, the dataset for each class loads the whole dataset, and throws out all but 1/n of it. So if you have a larger dataset with a larger number of classes you’ll probably want to cache each of the class-datasets (but there might also be a way to fix it with querues).

Sayak_Paul · May 11, 2021, 4:10pm

True that. Let’s just continue putting together our hacks and benchmark them. Who knows, future readers may find these incredibly useful.

On a slightly related note, as you may already know this kind of stratified sampling is pretty common for few-shot classification tasks (particularly for models like Prototypical Networks). Might be a good idea to work on a tutorial concerning this topic.

innat · July 26, 2022, 1:11pm

@Sayak_Paul @markdaoust
Thanks for this insightful discussion and working workaround mentioned here. It’s really helpful.

I’m trying to get similar output from tf.data API, especially while working with tf-similarity data sampler. For TFRecord format, it also adopts similar functions (interleave) from tf.data, here. But those samples additionally require each class to execute continuously. For example: in a batch, with num of repeated sample = 4

[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, ...]

Is there any convenient function from tf data API to achieve this sorting after batching on the training pairs? Following approach might be the way but after interleave, I’m expecting much optimize approach. Any tips?

batch_size = 240
num_classes = 10

dataset_encode = tf.data.Dataset.range(num_classes)
dataset_encode = dataset_encode.interleave(dataset_for_class, 
                                           cycle_length=10)
dataset_encode = dataset_encode.batch(batch_size) 


dataset0 = tuple(dataset_encode.filter
                 (
                     lambda x, y: tf.equal
                     (
                         y[n], n
                     )
                 ) for n in range(num_classes)
                )
...
zipped = tf.data.Dataset.zip(dataset0)
...

Update

One possible solution.

dataset = tf.data.Dataset.range(1, 6)  
dataset = dataset.interleave(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(5),
    cycle_length=1, 
    block_length=3,
)
list(dataset.as_numpy_iterator())

[1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5])

innat · July 28, 2022, 8:46pm

Opened a ticket here for the slow process regarding this in the tf.data API.

github.com/tensorflow/tensorflow

Efficiently get an equal number of per class data point continuously with `tf.data` API.

opened 01:39PM - 28 Jul 22 UTC

innat

comp:apis type:performance 2.6.0

[Info] TensorFlow: 2.6 Environment: Kaggle / Colab Accelerator: TPU # Cu…rrent Behaviour? I am trying to get equal number of sample per class within a batch of data from `tf.data` API. With `batch_size = 14` and `sample_per_class = 3`, I'm expecting to get as follows for `num_classes = 5`: ```yaml 1st batch: [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5] 2nd batch: [5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5] ``` # Standalone code to reproduce the issue Here is the standalone code and my trails so far. ```python def dataset_for_class(i): i = tf.cast(i, tf.int32) return dataset.filter(lambda label: label == i) num_classes = 1000 num_contiguous_instances = 2 dataset = tf.random.uniform([num_classes], maxval=num_classes, dtype=tf.int32).numpy() dataset = tf.data.Dataset.from_tensor_slices(dataset) dataset = tf.data.Dataset.range(num_classes).interleave( dataset_for_class, cycle_length=num_classes, block_length=num_contiguous_instances, ) dataset = dataset.prefetch(tf.data.AUTOTUNE) list(dataset.as_numpy_iterator()) [1, 2, 2, 4, 4, 6, 7, 8, 8, 10, 10, 11, 12, ...] # 6 appears 1 time because of dummy random input. ``` ```bash %%timeit a = next(iter(dataset)) 16.6 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` The issue arises when `num_classes` gets bigger, for example, image-net (1000 classes). It gets too slow to train the model, GPU/TPU both. I'm looking for an efficient solution with `tf.data` API.

Bhack · July 28, 2022, 11:10pm

It is also hard to create equal splits in TF datasets:

github.com/tensorflow/datasets

Equal number of samples per class in each split

opened 12:07PM - 24 Sep 21 UTC

lahr-ul

help

I have a custom dataset with 500 samples in 10 classes. The dataset is balanced,… i.e. there are 50 samples per class. I have a helper function to verify that fact: ```python def generate_stats(ds): stats = {} for sample in ds: label = sample['label'].numpy() if label in stats: stats[label] += 1 else: stats[label] = 1 print(collections.OrderedDict(sorted(stats.items()))) generate_stats(tfds.load('mydataset', split='train')) > OrderedDict([(0, 50), (1, 50), (2, 50), (3, 50), (4, 50), (5, 50), (6, 50), (7, 50), (8, 50), (9, 50)]) ``` Now I want to have two equal splits, i.e. the number of samples per class is the same in each split, but it doesn't work: ```python d1, d2 = tfds.load('mydataset', split=['train[:50%]', 'train[50%:]']) generate_stats(d1) generate_stats(d2) > OrderedDict([(0, 24), (1, 20), (2, 24), (3, 27), (4, 28), (5, 19), (6, 26), (7, 28), (8, 25), (9, 29)]) > OrderedDict([(0, 26), (1, 30), (2, 26), (3, 23), (4, 22), (5, 31), (6, 24), (7, 22), (8, 25), (9, 21)]) ``` How can I get the same number of samples per class in each split?