Managing dataset with unbalanced labels

sollted_sollted · December 26, 2023, 1:28am

i am training on dataset which has three classes :class, sum of labels:

0, 3132
1, 492
-1, 12

as you can see there is huge inbalance here so wanted to fix it using class weights (maybe there is better way). So i created dict: {-1: 85.27777777777777,0: 0.3919315715562364,1: 2.2893363161819535}

passed it to .fit: model.fit (self.Xs, self.ys, epochs=epoch, batch_size=batch, class_weight = class_weight_dict)

error:ValueError: Expected \class_weight to be a dict with keys from 0 to one less than the number of classes, found {-1: 85.27777777777777, 0: 0.3919315715562364, 1: 2.2893363161819535}

so i changed class_weight_dict to {0: 85.27777777777777, 1: 0.3919315715562364, 2: 2.2893363161819535} it feels wrong i dont know how keras is suposed to know which index is for what label but i still get error (it gets further there is 1/15 epochs):

2 root error(s) found. (0) INVALID_ARGUMENT: indices[49] = -1 is not in [0, 3) [[{{node GatherV2}}]] [[IteratorGetNext]] [[Cast/_16]] (1) INVALID_ARGUMENT: indices[49] = -1 is not in [0, 3) [[{{node GatherV2}}]] [[IteratorGetNext]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_5684]

this is my out layer:
model.add(Dense(3, activation='softmax')) i wanted to use Dense(1, activation='tanh')) but chatGPT said that is not good idea and was not able to explain why. maybe you could shed some light to that?

compilation of model:

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

thanks in advance for any explanations/solutions/ideas

astle_dsa · December 26, 2023, 5:52am

Using softmax is much better actually, saves a lot of computational time.
The way the model might know which weight belongs to which class, is if you have one-hot encoded the labels to be:
-1: [0, 0, 1]
1: [1, 0, 0]
0: [0, 1, 0]

Clearly indicating the assignment of weights. Another way ( if you can ) is to lump the 1 and -1 classes together to perform a simple binary classification, which would be much simpler and easier to perform. If not, try using these weights :

Class weights = {
0: 0.342,
-1: 9.754,
1: 2.457
}

Calculation: Weight for class i = (1 / number of samples in class i) * (total number of samples / total number of classes)

Kiran_Sai_Ramineni · December 26, 2023, 6:49am

Hi @sollted_sollted, As you have 3 labels it will come under multi-class classification task, where the last dense layer should have neurons equal to the number of labels. so you should use
model.add(Dense(3, activation='softmax'))

The tanh activation can be used for binary classification tasks where you have only 2 classes to be predicted. Thank you.

sollted_sollted · December 26, 2023, 7:29am

I really appriciate your quick answear and explanation of the output layers all clear now.
but one-hot encoding? am i supposed to pass it to compilation of the model or fitting of it?
Or are the actual labels supose to be 2d array with ONE label looking like this: “[0, 0, 1]”
Unfortunately i cant merge any classes.
if i give model this:
Class weights = {
0: 0.342,
-1: 9.754,
1: 2.457
}
i get: Expected class_weight to be a dict with keys from 0 to one less than the number of classes, found {0: 0.342, -1: 9.754, 1: 2.457}

thanks again for your time.

astle_dsa · December 26, 2023, 11:50am

One-hot encode your labels, you can use tf.keras.utils.to_categorical (labels, num_classes=3) for this purpose. Where it’ll convert them to the vectors I mentioned above.
You can try the original dict, but with the weights I provided,

classWeigths = { 0: 2.457, 1: 9.754, 2: 0.342 }