What is the effect of unbalanced training data?

I have read in a few place that it is important to have equal sized sample sets. For example, if you were training an image recognition network to differentiate between cats and dogs, what would the effect of having 1,000 cat samples and 1,000 dog samples?

Is there (an at least approximate) measure of how unbalanced sample sets effect a model?

If one sample set had more variation (lets say there were 10x more species of dogs than of cats) would you want to have more dog samples or would it be better to keep the same?


With a balanced dataset the model will learn the features of cats and dogs from the dataset and it will be able to generalize to unseen data.

The effect of unbalanced sample sets can vary depending on the specific task and the model being used.

Having more samples from the class with more variation can help to differentiate between the different species of that class. Here you should consider downsampling the dog’s dataset and keeping the same amount of cat samples.

Thank you!