I’m new to TensorFlow and have been trying to get a simple benchmark running to measure the predication latency for a model. My source code is as follows.
import time import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers tf.config.optimizer.set_jit(True) # Enable XLA def run(model, inputs, iters): total_time = 0 for i in range(iters): start = time.perf_counter() model(*inputs, training=False) end = time.perf_counter() print(1000*(end - start)) total_time += end - start return (1000 * total_time) / iters batch_size = 32 head_size = 64 num_heads = 8 model_dim = 512 max_len = 512 inputs = layers.Input(shape=(max_len, model_dim)) outputs = layers.MultiHeadAttention(num_heads=num_heads, key_dim=head_size, value_dim=head_size, use_bias=False)(inputs, inputs) model = keras.Model(inputs=inputs, outputs=outputs) x = tf.random.uniform((batch_size, max_len, model_dim), minval=0, maxval=1) run(model, [x], 100) print(run(model, [x], 100))
I observe that the execution latency is low for the first few iterations and then it rises to plateau at a different, but higher value (I’ve attached a plot of the execution time vs iterations below). I suspect this is due to memory management but I’m not sure. How do I go about performing such benchmarking correctly? My environment is as follows:
Software: TensorFlow 2.5.0, Python 3.8.10. I’m using a docker container running CUDA 11.
Hardware: GeForce GTX TitanX