Multi-GPU inference - am I doing it right?

Hi everyone. My first post here :slight_smile:

So I’ve found a way to run my inference using my 2 GPUs, but this takes as much time as if I was running it on 1. I am quite new to tf / multiGPUs, so yeh, I reckon I need some help to:

  • understand whether I am going to the right direction
  • possibly help me in improving what’s been done so far

My inputs are 3D volumes, reshaped as (16, 16, 16, 64, 64, 64), then here is what comes:

#multiGPU try 1
model_path = pathLogDir+'/'+folder_name+ "/"+weights_name

json_file = open(model_path+".json", 'r')
#json_file = open(newpath+'model.json', 'r')
loaded_model_json =
loaded_model = model_from_json(loaded_model_json)

new_patches = np.reshape(patches, (-1, number_patchify, number_patchify, number_patchify))

strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

outputs = []
new_dimension = 3

print("patches shape: " + str(patches.shape))
dataset =
distributed_dataset = strategy.experimental_distribute_dataset(dataset)

print("number of batches: " + str(

def inference_step(inputs):
    # Run the forward pass on the model
    predicted =, args=(inputs,))
    return strategy.reduce(tf.distribute.ReduceOp.MEAN, predicted, axis=None)

total_time = 0.0
begin = time.time()

for batch in distributed_dataset:
    tensors_tuple = batch.values
    batch_array = tf.identity(tensors_tuple)
    batch_array = np.reshape(batch_array, (-1, number_patchify, number_patchify, number_patchify)) 
    input_array_with_new_dim = batch_array[..., np.newaxis]
    output_array = np.repeat(input_array_with_new_dim, 3, axis=-1)
    predicted = inference_step(output_array.astype(np.float32))
    predicted = predicted >= 0.5

total_time += time.time() - begin

print('Total time on 2 GPUs: ', total_time)

#outputs = np.concatenate(outputs)
outputs = np.array(outputs)

# Reshape the concatenated outputs to match the original image

outputs_reshaped = np.reshape(outputs, (patches.shape[0],patches.shape[1],patches.shape[2],number_patchify,number_patchify,number_patchify))

Predictions are working, I get my volumes predicted, and both mu GPUs are used, but it does take as much time as if 1 was used. Also, my GPUs are working at like… 40,50% top (this highly depends on the bach size of course)? I am really reaching the end of my knowledge here, and can’t seem to find anything online after weeks or searching.

thanks in advance everyone :slight_smile:

Hi @FloFive ,

Please try the following steps:

  1. Experiment with different batch sizes
    2.Use mixed precision
    3.Use a larger model
  2. Profile your code
  3. Experiment with different distribution strategies
  4. Use a data format that is optimized for GPUs

It’s difficult to say exactly why your current implementation is not achieving the desired performance without more information about your specific use case and system setup

Please let me know if it helps you.


Hi and thank you for the kind reply !

1.1 and 1.2 have been tested. doesn’t affect the prediction much (actually, batch size will get me OOM on the super large volumes, makes sense)

will plan on doing 2. Any recommendation on how to do it correctly?

I find that mirroredstrategy is the one that I need. Again, if you’ve got some good tutorial I’d take it. Maybe I got confused by how strategies work, and the one best suited for my case.

As for 4, what do you mean by optimized data format ?

I guess it’s hard to understand the objective of my code by just looking at the inference function. All I know here is that it takes as much time as if I was using the tf.saved_model.load function + tensorflow_graph.signatures. Maybe this makes sense?

All I know is that upon running this piece of code, both GPUs are used. Maybe not corectly ^^

Thanks for all the help !