Simple benchmarking of model predication times


I’m new to TensorFlow and have been trying to get a simple benchmark running to measure the predication latency for a model. My source code is as follows.

import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.config.optimizer.set_jit(True) # Enable XLA

def run(model, inputs, iters):
    total_time = 0
    for i in range(iters):
        start = time.perf_counter()
        model(*inputs, training=False)
        end = time.perf_counter()
        print(1000*(end - start))
        total_time += end - start
    return (1000 * total_time) / iters

batch_size = 32
head_size = 64
num_heads = 8
model_dim = 512
max_len = 512

inputs = layers.Input(shape=(max_len, model_dim))
outputs = layers.MultiHeadAttention(num_heads=num_heads, key_dim=head_size, value_dim=head_size, use_bias=False)(inputs, inputs)

model = keras.Model(inputs=inputs, outputs=outputs)

x = tf.random.uniform((batch_size, max_len, model_dim), minval=0, maxval=1)
run(model, [x], 100)
print(run(model, [x], 100))

I observe that the execution latency is low for the first few iterations and then it rises to plateau at a different, but higher value (I’ve attached a plot of the execution time vs iterations below). I suspect this is due to memory management but I’m not sure. How do I go about performing such benchmarking correctly? My environment is as follows:

Software: TensorFlow 2.5.0, Python 3.8.10. I’m using a docker container running CUDA 11.
Hardware: GeForce GTX TitanX


Have you tried to analyze this with our profiler?

1 Like

Thanks for that suggestion! It looks to me that this is happening due to asynchronous execution of kernels on the GPU. When I run the same program on a CPU, the variation in run times I talk about above vanishes. The trace the profiler shows also seems to suggest the same thing. Does that seem right to you? If yes, is there a way to synchronize kernel execution in tensorflow, say after every call to model() in my code above?

Have you tried to access to the output?

I have, but wouldn’t that also include data transfer times? I’m hoping to measure only the execution time.

With the profiler details you could exclude GPU->CPU transfer.