Tensor flow Dataset Input Pipelines

Saran_Zeb · January 9, 2023, 7:30pm

I am trying to run dataset in tensor flow but the data set is too heavy for the ram memory of my GPU so i need to apply some tensorflow pipeline on it, and need to load it from a HDF file per batch but again as the train_test_split is not accepting the DHF format directly ,and I need to convert it to numpy. But when i am trying to convert the whole dataset (which is two two arrays of trainX and trainY )numpy so that it can be readable for train_test_split, it is still enough heavy for for my ram.
So what is the correct syntax and code for reading data from file so that the model can read data per batch, from another file (like hdf) instead of putting it on ram. Then splitting it into train and test, making tensor slices from it and applying repeat, shuffle and batching on it,
The Data is saved in HDF file with the two arrays in it trainX and trainY,(i,e the the data and its truth values). I want to implement some pipelines techniques on it, so it can read data per batch and then perform the following operation on it i have read about TFRecordDataset but still cant implement it how to implement it in my case
"
trX, teX, trY, teY = train_test_split(trainX , trainY,
test_size = .1, random_state = 42)
train_data = tf.data.Dataset.from_tensor_slices((trX, trY))
train_data = train_data.repeat().shuffle(buffer_size=500,
seed= 8).batch(batch_size).prefetch(1)
"

these above steps i am implementing for whole dataset now how to pipeline it for every batch

Blockquote

Renu_Patel · February 28, 2024, 7:07am

Hi @Saran_Zeb

You can fetch the .hdf format dataset as below and can use it in tf.data input pipelines for model training: (Here, I have taken the mnist dataset for the example)

import h5py
import tensorflow as tf

# Load data
test_filename = "/content/test.hdf5"

# Define function to read data from HDF5 file in batches
def load_h5py_data(filename, batch_size=32):
  with h5py.File(filename, 'r') as f:
    testX = f['image'][:]  
    testY = f['label'][:]  
    print(testX.shape)

    # Calculate number of batches
    num_batches = testX.shape[0] // batch_size
    print(num_batches)

    # Define a dataset for efficient batching
    dataset = tf.data.Dataset.from_tensor_slices((testX, testY))
    dataset = dataset.repeat()
    dataset = dataset.shuffle(buffer_size=500, seed=42)
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(1)

    return dataset, num_batches

#split into batches (avoiding in-memory conversion)
test_dataset, num_batches = load_h5py_data(test_filename)

Output:

(10000, 28, 28)
312