How to do Minority class sampling using tensorflow?

Hi, I’m new to this field and I’m trying to do minority class sampling.
I have about 754975 cropped CT images, the size of each one is 19 * 19 * 19, saved as .npy on my local disk.

The truth table is saved as .csv, with the state of each image non-nodule or nodule (0,1), the data is imbalanced with 1186 image = 1 and the total rest is = 0.

I need to do minority class sampling as follow :
2000 images for validating set ( 700 nodule, 1300 non-nodule).
752975 images for training set ( 486 nodule, 752489 non-nodule).

I tried to do it using the following code, but the problem was the allocating memory exceeds my PC memory (32 gb)

nodules_path = "~/cropped_nodules/"
nodules_csv = pandas.read_csv("~/cropped_nodules_2.csv")

positive = 0
negative = 0
x_val = []
x_train = []
y_train = []
y_val = []

for nodule in nodules_csv.iterrows():
    if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
        positive += 1
        x_val_img = str(nodule.SN) + ".npy"
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    elif nodule.state == 0 and negative <= 1300 and len(x_val) <= 2000:
        x_val_img = str(nodule.SN) + ".npy"
        negative += 1
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    else:

        if len(x_train) % 10000 == 0:
            gc.collect()
            print("gc done")
        x_train_img = str(nodule.SN) + ".npy"
        x_train.append(np.load(os.path.join(nodules_path,x_train_img)))
        y_train.append(nodule.state)
        print("x_train len= ", len(x_train))
        print("Size of list1: " + str(sys.getsizeof(x_train)) + "bytes")

I tried to do many things to stop filling the momery, but I think the solution is not load the whole data to the memory at all, and I should find another method.

This post in stackoverflow summeraize my problem and my attempts to solve the memory problem.

I couldn’t figure out how to proparly load it using tensorflow dataset, or any other method.

I know the data is really imbalanced, I’ll try to do many things to overcome the imbalance
(Minority class sampling, data augmentation, minority oversampling, and weighted loss like binary cross entropy loss).

Any help will be appreciated, thanks in advance.

1 Like

See if this works for you:

Earlier this summer I implemented a stratified sampler with tf.data that you could refer to as well:

The script is a bit involved so please feel free to ask questions as needed.

1 Like