Multi-GPU doesn't work for model(inputs) nor when computing the gradients


When using multiple GPUs to perform inference on a model (e.g. the call method: model(inputs)) and calculate its gradients, the machine only uses one GPU, leaving the rest idle.

For example in this code snippet below:

import tensorflow as tf
import numpy as np
import os

# Make the tf-data
path_filename_records = 'your_path_to_records'
bs = 128

dataset =
dataset = (dataset

# Load model trained using MirroredStrategy
path_to_resnet = 'your_path_to_resnet'
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    resnet50 = tf.keras.models.load_model(path_to_resnet)

for pre_images, true_label in dataset:
    with tf.GradientTape() as tape:
       outputs = resnet50(pre_images)
       grads = tape.gradient(outputs, pre_images)

Only one GPU is used. You can profile the behavior of the GPUs with nvidia-smi. I don’t know if it is supposed to be like this, both the model(inputs) and tape.gradient to not have multi-GPU support. But if it is, then it’s a big problem because if you have a large dataset and need to calculate the gradients with respect to the inputs (e.g. interpretability porpuses) it might take days with one GPU.
Another thing I tried was using model.predict() but this isn’t possible with tf.GradientTape.

Here’s a working example of it. I ask to whoever has more than one GPU to try and check if the issue still remains.

Notebook: Working Example.ipynb

Saved Model:

And the Saved Format version of it. For some reason the forum doesn’t allow for new users to add more than 2 links…

Solution found! keras - Tensorflow - Multi-GPU doesn’t work for model(inputs) nor when computing the gradients - Stack Overflow

Would also suggest check this out:

Ideally, you would want to encapsulate your loop inside a subclassed model (tf.keras.Model) by overriding the train_step() method.

Consider the GAN example here. See how the loop is implemented inside train_step(). You could simply initialize the subclassed model within the scope and make use of multiple GPUs. I have done this on multiple occasions and it has only simplified my workflow.

1 Like

Hi @Sayak_Paul, thanks for sharing the links!

The problem is at inference time, and sure there are a lot of good documentation like the TensorFlow Distributed Training or the Keras ones that you linked above, but all of these demonstrate how to make use of multiple GPUs at training time.

One of the things that I tried was to create a @tf.function and use the run method of the MirroredStrategy, but this didn’t work out. The issue is that I was getting lots of Nones for reasons I couldn’t understand.
What the StackOverflow solution points out is the use of the tf.gather method, that joins the output data back along the first dimension, as Laplace Ricky points out.

As far as I’m concerned, this is the first solution that I could find at inference time.

But why would you want to compute gradients during inference?

If your test_step() and predict() are implemented correctly, you should be able to do:

with strategy.scope():
    preds = model.predict(images)

And this should be utilizing multiple GPUs. I can confirm this from experience.

Also, the second link I provided walks you through the steps of how to simplify the process of writing training and inference loops when customizations are required.

I work in interpretability of machine learning, and most of the methods use the computation of gradients, whether this is done in regards to the inputs or to some feature maps [1, 2].

To calculate the gradients in TensorFlow you need to use the GradientTape and inside this tape you are required to make inference of the model with the __call__.

This is 100% true :smiley:
The issue is that you can’t make use of model.predict inside the GradientTape.

That being said, in the second link where there are inference loops with customizations, it really isn’t applicable in this scenario (I think).

Nevertheless, I appreciate all the help and sharing those links. I didn’t know it was possible to do such things in TensorFlow and they will certainly be useful in the future :smile:

Oh, I see. This now has the full context.

What you can maybe do is implement a separate function within your subclassed model that returns the desired gradients and have it invoked within the strategy scope. I haven’t tried this method myself so cannot confirm its effectiveness but I still think it’s worth trying.

To distribute your computation, you will need to use, here is the snippet(from your colab notebook) that shows the same. This will use both the GPUs to run the computation.

def _inference_step(data):
    images, labels = data
    with tf.GradientTape() as tape:
        outputs = model(pre_images, training=False)
        grads = tape.gradient(outputs, pre_images)
    return grads

def _distributed_inference_step(data):
    images, labels = data
    grads =, args=((data,)))
    # gather values from all replicas
    return strategy.gather(grads, axis=0)

for pre_images, true_label in eval_dataset:
    grads = _distributed_inference_step((pre_images, true_label))