How to do Minority class sampling using tensorflow?

Mustafa_Mahmood · June 13, 2021, 1:00pm

Hi, I’m new to this field and I’m trying to do minority class sampling.
I have about 754975 cropped CT images, the size of each one is 19 * 19 * 19, saved as .npy on my local disk.

The truth table is saved as .csv, with the state of each image non-nodule or nodule (0,1), the data is imbalanced with 1186 image = 1 and the total rest is = 0.

I need to do minority class sampling as follow :
2000 images for validating set ( 700 nodule, 1300 non-nodule).
752975 images for training set ( 486 nodule, 752489 non-nodule).

I tried to do it using the following code, but the problem was the allocating memory exceeds my PC memory (32 gb)

nodules_path = "~/cropped_nodules/"
nodules_csv = pandas.read_csv("~/cropped_nodules_2.csv")

positive = 0
negative = 0
x_val = []
x_train = []
y_train = []
y_val = []

for nodule in nodules_csv.iterrows():
    if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
        positive += 1
        x_val_img = str(nodule.SN) + ".npy"
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    elif nodule.state == 0 and negative <= 1300 and len(x_val) <= 2000:
        x_val_img = str(nodule.SN) + ".npy"
        negative += 1
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    else:

        if len(x_train) % 10000 == 0:
            gc.collect()
            print("gc done")
        x_train_img = str(nodule.SN) + ".npy"
        x_train.append(np.load(os.path.join(nodules_path,x_train_img)))
        y_train.append(nodule.state)
        print("x_train len= ", len(x_train))
        print("Size of list1: " + str(sys.getsizeof(x_train)) + "bytes")

I tried to do many things to stop filling the momery, but I think the solution is not load the whole data to the memory at all, and I should find another method.

This post in stackoverflow summeraize my problem and my attempts to solve the memory problem.

I couldn’t figure out how to proparly load it using tensorflow dataset, or any other method.

I know the data is really imbalanced, I’ll try to do many things to overcome the imbalance
(Minority class sampling, data augmentation, minority oversampling, and weighted loss like binary cross entropy loss).

Any help will be appreciated, thanks in advance.

Sayak_Paul · June 13, 2021, 2:55pm

See if this works for you:

Earlier this summer I implemented a stratified sampler with tf.data that you could refer to as well:

github.com

sayakpaul/PAWS-TF/blob/main/utils/labeled_loader.py

from . import multicrop_loader, config
import tensorflow as tf
import numpy as np
import os

GLOBAL_SCALE = [0.75, 1.0]
AUTO = tf.data.AUTOTUNE
(X_TRAIN, Y_TRAIN), (_, _) = tf.keras.datasets.cifar10.load_data()


def onehot_encode(labels, label_smoothing=0.1):
    """
    One-hot encode labels with label smoothing.

    :param labels: (batch_size, )
    return: one-hot encoded labels with optional label smoothing
    """
    labels = tf.one_hot(labels, depth=10)
    # Reference: https://t.ly/CSYO)
    labels *= 1.0 - label_smoothing

This file has been truncated. show original

The script is a bit involved so please feel free to ask questions as needed.