Pruning doesn't reduce inference time

I have a model trained on GTSRB dataset that I want to prune. I applied the following pruning schedule

pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0,

and run the pruning fit method for 100 epochs with the final goal of demonstrating that a highly pruned model has faster execution time during inference. I converted both the initial trained model and the final pruned model to Tensorflow Lite using the Interpreter and then recorded the average inference time over a span of 100 executions

x_test_norm, y_test  = load_data()

# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="simplified.tflite")

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Test model on input data.
interpreter.set_tensor(input_details[0]['index'], x_test_norm[0])
deltas = []

for _ in range(100):
    begin =
    end =

deltas = [x.total_seconds() for x in deltas]

Unfortunately the record times are almost identical. My questions are: am I doing something wrong during the pruning phase? Shouldn’t the pruned model use sparse math operations and thus be much faster than the non pruned model? Is there a way to force Tensorflow lite to use sparse operators to decrease the inference time?

The two TensorflowLite models are being run on a Jetson Nano with Tensorflow version 2.4.1 which is not the latest version but it is fully supported.

It seems you are using random unstructured pruning, which I don’t think provides any acceleration. You could use structural (e.g. 2:4) pruning:

They mention:

Compare to the random sparsity, the structured sparsity generally has lower accuracy due to restrictive structure, however, it can reduce inference time significantly on the supported hardware.

Also, I am not sure if CPUs can benefit from sparse inference. The supported hardware I know of is CUDA GPUs:

I tried applying structural pruning

batch_size = BATCH_SIZE
epochs = 50
pruning_params_2_by_4 = {
    'sparsity_m_by_n': (2, 4),
model_for_pruning = prune_low_magnitude(model, **pruning_params_2_by_4)

But the execution time still doesn’t improve, with an average of 2.01986 ms for the pruned model and 2.07721 ms for the regular one.

Regarding cuSPARSELt the problem is that I don’t want to use CUDA directly, I would like to use Tensorflow on my GPU.
But I am wondering if either TF or TFLite exploit the sparsity of the model to improve the execution times or if this is not a behaviour that I should expect.