Improvements to the Tutorial: Classification on imbalanced data

Christian_Lorentzen · December 12, 2022, 2:32am

Hi there
I’m new to this forum and don’t know where to best address the following topic.
The tutorial Classification on imbalanced data first uses a simple sequential net with sigmoid activation. Then it proceeds with class weights and resampling techniques. But the last two plots of the tutorial, ROC and recall-precision, clearly show that (almost) no matter which threshold is chosen the first model clearly outperforms the resampled/reweighted models for all metrices on the test set. So I have 3 questions:

What is the justification for reweighting and resampling given that they do not result in better models? Also, the stats community does not seem to find a good reason, see https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he.
Would it make sense to split the logic of the tutorial into modelling class probabilities (implied by the sigmoid activation) and making a decision, i.e. choosing a threshold and predicting classes (instead of class probabilities)?
Would it make sense to emphasize the cross entropy / log loss a bit more? Reasoning: The tutorial states that accuracy is not a helpful metric for imbalanced data, but it is not said which metric to prefer. Cross entropy as a proper scoring rule is a good metric to compare models and find out which one gives the best predictions for the class probabilities.

Kiran_Sai_Ramineni · October 12, 2023, 8:54am

Hi @Christian_Lorentzen,In the tutorial they resample the dataset by oversampling the minority class. Which increases the number of samples of the minority classes. This oversampled data provides a smoother gradient signal(Instead of each positive example being shown in one batch with a large weight, they’re shown in many different batches each time with a small weight) this makes it easier to train the model.

In such cases F1 Score, AUC ROC (Area Under the Curve of the Receiver Operating Characteristic) are preferred metrics. Thank You.