In each epoch, is it necessary to go through every sample in the training data at least once?

hiydavid · August 16, 2021, 3:57am

Hi, new here. Question about training epoch in general. I’m trying to create a custom training loop from scratch, and have a question about epochs.

Let’s say I have a function that randomly extracts a number of batch size from training data. Something like this:

def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

Let’s say I will train for 50 epochs, and within each epoch, I will use the above function on the dataset 1000 times because training set is 32,000 samples.

My question is, for each epoch, is it necessary to ensure that EVERY sample in the training data gets through the network once? Or is it okay if I just randomly select from 32 samples. In other words, do I need an additional step in the code where I drop the 32 samples from the training data after having them gone through the network, so that in future steps those samples won’t be learned again. This is to ensure every sample is learned once. Or is this not necessary?

Thanks and sorry for the newbie question!

Sayak_Paul · August 16, 2021, 4:11am

Well, it’s not a newbie question at all. I am sharing a relevant snippets from the ML Design Patterns book. See if that makes sense.

Bhack · August 16, 2021, 7:31am

In general sampling with or without replacement It Is still quite an open topic for neural network with SGD or other optimizers

See also:

hiydavid · August 16, 2021, 4:44pm

Thanks for the links – they are very helpful. I also found this paper that studied this question, and from the abstract:

Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled with replacement. In con- trast, sampling without replacement is far less understood, yet in practice it is very common, often easier to implement, and usually performs better.