Tensorflow project problem

Hello, I have been stuck with my project more than 1 month. I hope I can receive guides from experienced people.

I. My hardware
CPU: Intel Xeon E5 - 2678 v3 2.50GHz
Ram: 256GB
GPU: 2 x GPU GeForce 2080Ti 11GB

II. Overview model

           x1                               x2
        backbone                         backbone
             --------- concatenation --------
                            |
                   Fully Conected Layer
                            |
                          output

III. My code
Data preparation:

class Data:
    data = []
    init = False
    datagen = ImageDataGenerator(rescale=1./255.)
    #initize
    def __init__(self, path, img_size = (320, 320)):
        all_file = os.listdir(path) #take all couple files
        #load couple images
        data1 = []
        data2 = []
        label = []
        for i in all_file:
            #take couple path
            if platform.system() == 'Darwin' and i.startswith('.'):
                continue
            temp_path = os.listdir(path + '/' + i)
            temp_path.pop(temp_path.index('label.txt'))
            f = open(path +'/' + i + '/label.txt', "r")
            label.append(int(f.read()))
            data1.append(cv2.resize(cv2.imread(path +'/' + i + '/' + temp_path[0]),img_size))
            data2.append(cv2.resize(cv2.imread(path +'/' + i + '/' + temp_path[1]),img_size))
            
        self.data = np.array([data1, data2])
        self.label = np.array(label)
        self.init = True

    def load_data_generator(self, b_size):
        if not self.init :
            raise Exception('Data need to be initialized first')
        # print(np.shape(self.data))
        # generator = self.datagen.flow(x = part_data,y = part_label, batch_size=8)
        
        genX1 = self.datagen.flow(x = self.data[0],
                                  y = self.label,
                                batch_size = b_size,
                                shuffle=False, 
                                seed=7)
    
        genX2 = self.datagen.flow(x = self.data[1],
                                  y = self.label,
                                batch_size = b_size,
                                shuffle=False, 
                                seed=7)
        while True:
            X1i = genX1.next()
            X2i = genX2.next()
            yield ([X1i[0], X2i[0]], X2i[1])

I have experienced with ImageDataGenerator with 1 input. However, with 2 inputs I still confused how to prepare data for this model. I hope that I am received some advices for this problem.
Model
My python: 3.11.4
My tensorflow: 2.13 (WSL: Ubuntu)
I installed Tensorflow following to https://www.tensorflow.org/install/pip

My code of model:

if __name__ ==  '__main__':
    print(tf.__version__)
    strategy = tf.distribute.MirroredStrategy()
    print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

    with strategy.scope():
        resnet_1 = ResNet101(input_shape = (320, 320, 3), 
                                        include_top = False, 
                                        weights = None)
        resnet_2 = ResNet101(input_shape = (320, 320, 3), 
                                        include_top = False, 
                                        weights = None)
        x = resnet_1.layers[-2].output
        y = resnet_2.layers[-2].output
        #fix duplicate name
        for layer in resnet_1.layers :
            layer._name = layer.name + str('_1')
        for layer in resnet_2.layers :
            layer._name = layer.name + str('_2')
        # combine the output of the two branches
        combined = concatenate([x, y])
        # apply a FC layer and then a regression prediction on the
        # combined outputs
        z = Flatten()(combined)
        z = Dense(8, activation="relu")(z)
        z = Dense(1, activation="sigmoid")(z)
        # our model will accept the inputs of the two branches and
        # then output a single value
        model = Model(inputs=[resnet_1.input, resnet_2.input], outputs=z)
        model.compile(loss=tf.keras.losses.BinaryCrossentropy(), optimizer='adam')

        tmp = data.load_data_generator(100)
        model.fit(data.load_data_generator(100), batch_size = 16,
                        epochs=20)

And It thrown erorrs:

3 root error(s) found.
  (0) RESOURCE_EXHAUSTED:  OOM when allocating tensor with shape[50,256,80,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/conv2_block3_3_conv_2/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

         [[update_0/AssignAddVariableOp/_927]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

         [[div_no_nan/ReadVariableOp_3/_912]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED:  OOM when allocating tensor with shape[50,256,80,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/conv2_block3_3_conv_2/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

         [[update_0/AssignAddVariableOp/_927]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (2) RESOURCE_EXHAUSTED:  OOM when allocating tensor with shape[50,256,80,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model/conv2_block3_3_conv_2/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_217696]
2023-09-14 01:54:14.566063: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]

I hope I can receive help. This project is important with me. Thank for reading my post

An OOM error means your gpu doesn’t have enough memory to train this model.

You can try reducing the batch size (8, 4 or even 2) to get around this, upgrade your GPU or create a VM specifically for heavy machine learning workflows in a cloud platform.

I don’t know too much about Resnet101 specifics but a quick google indicated that 8Gb of memory should be enough to “run” a single Resent. However you are using 2 Resnets and want to train (which typically requires much more memory than just using it for prediction due to backprop); it seems logical that your model simply doesn’t fit on your GPU with that batch size.

Hi @Thien_Tan ,

It seems like you’re encountering “Out of Memory” (OOM) errors when training your model. OOM errors occur when the GPU or memory is not able to handle the computation or data size.

Here are a few suggestions to address this issue:

  1. Reduce Batch Size
  2. Resize Images
  3. Use a Subset of Data for training.
  4. Clear GPU Memory: Make sure to clear the GPU memory after each training epoch to release any unnecessary memory.
  5. Use Mixed Precision : Utilize mixed precision training to reduce memory requirements.
  6. Reduce Model Complexity:If possible, simplify your model architecture to reduce the number of parameters and memory usage.
  7. Limit Data Loading : Load and preprocess data on-the-fly within your data generator instead of loading all the data into memory at once.

I hope above tips can help you to solve your error.

Thanks.

1 Like