Simple benchmarking of model predication times


I’m new to TensorFlow and have been trying to get a simple benchmark running to measure the predication latency for a model. My source code is as follows.

import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.config.optimizer.set_jit(True) # Enable XLA

def run(model, inputs, iters):
    total_time = 0
    for i in range(iters):
        start = time.perf_counter()
        model(*inputs, training=False)
        end = time.perf_counter()
        print(1000*(end - start))
        total_time += end - start
    return (1000 * total_time) / iters

batch_size = 32
head_size = 64
num_heads = 8
model_dim = 512
max_len = 512

inputs = layers.Input(shape=(max_len, model_dim))
outputs = layers.MultiHeadAttention(num_heads=num_heads, key_dim=head_size, value_dim=head_size, use_bias=False)(inputs, inputs)

model = keras.Model(inputs=inputs, outputs=outputs)

x = tf.random.uniform((batch_size, max_len, model_dim), minval=0, maxval=1)
run(model, [x], 100)
print(run(model, [x], 100))

I observe that the execution latency is low for the first few iterations and then it rises to plateau at a different, but higher value (I’ve attached a plot of the execution time vs iterations below). I suspect this is due to memory management but I’m not sure. How do I go about performing such benchmarking correctly? My environment is as follows:

Software: TensorFlow 2.5.0, Python 3.8.10. I’m using a docker container running CUDA 11.
Hardware: GeForce GTX TitanX


Have you tried to analyze this with our profiler?

Thanks for that suggestion! It looks to me that this is happening due to asynchronous execution of kernels on the GPU. When I run the same program on a CPU, the variation in run times I talk about above vanishes. The trace the profiler shows also seems to suggest the same thing. Does that seem right to you? If yes, is there a way to synchronize kernel execution in tensorflow, say after every call to model() in my code above?

Have you tried to access to the output?

I have, but wouldn’t that also include data transfer times? I’m hoping to measure only the execution time.

With the profiler details you could exclude GPU->CPU transfer.