How to fit large dataset to model?

Ashley · September 9, 2021, 10:34am

When I have a small dataset I’m fitting my model like that:

model.fit(
train,
train_labels,
epochs=200,
validation_split=0.2,
batch_size=100,
callbacks=[es],
use_multiprocessing=True
)

but now I can’t load the whole train set in one time as it’s too large, and wonder how I can fit this train set part by part to model?
(if I only can load train_part_1, train_part_2, train_part_3 from disk separately)

lgusm · September 9, 2021, 11:28am

Hi Ashley,

What you are trying to do is to use batch_size properly.

If you have your pipeline of data using tf.data.Dataset (tf.data.Dataset | TensorFlow Core v2.8.0) it will load the data from disk for you and provide it for the model in chunks that fit the memory. Of course the size of these chunks it’s up to you to define.

This is a great tutorial to give more insights: Better performance with the tf.data API | TensorFlow Core

Ashley · September 9, 2021, 10:54pm

thanks for your reply!! I tried several ways to load my data with tf.data.Dataset but no luck😿
I have all resized images saved as .npy files and I was trying these:

def map_func(feature_path):
  feature = np.load(feature_path)
  return feature

feature_paths = glob.glob('./*.np[yz]')

dataset = tf.data.Dataset.from_tensor_slices(feature_paths)

# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
          map_func, [item], tf.float16),
          num_parallel_calls=tf.data.AUTOTUNE)

print(dataset)

but can’t really understand how would i fit such a dataset to the model? I have labels for each .npy file in separate array but as I understand labels should be included to dataset somehow(?) because when I’m trying to add it usual way it throughs an error: ValueError: y argument is not supported when using dataset as input.

and without labels I’ve got ValueError: No gradients provided for any variable: ['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0'].

Could you please advise on how to add labels to the dataset properly?

lgusm · September 10, 2021, 9:35am

What I’d do it to see how one simple model does that like this one: Load and preprocess images | TensorFlow Core

maybe structure the folders to have files from a specific class into a dir with that name, like the flowers dataset does, and then use the same strategy

Ashley · September 10, 2021, 9:55am

I tried this (though couldn’t find a way to load my .zip file of images from local disk - so had to upload it to google disk to use get_file function ), but it only allowed me to download archive, not to unzip and load ( extract=True doesn’t work in my case )

lgusm · September 10, 2021, 11:20am

can you try uploading just a porting of the data raw (not zipped) just to test the pipeline?

Ashley · September 10, 2021, 7:55pm

yes thanks! it’s started to work after moving files to each class directory

lgusm · September 11, 2021, 2:06pm

perfect, glad that it worked and to be helpful