Pruning doesn't reduce inference time

I have a model trained on GTSRB dataset that I want to prune. I applied the following pruning schedule

pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0,
                                                               final_sparsity=0.99,
                                                               begin_step=0,
                                                               end_step=end_step,
                                                               frequency=1)
}

and run the pruning fit method for 100 epochs with the final goal of demonstrating that a highly pruned model has faster execution time during inference. I converted both the initial trained model and the final pruned model to Tensorflow Lite using the Interpreter and then recorded the average inference time over a span of 100 executions

x_test_norm, y_test  = load_data()

# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="simplified.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Test model on input data.
interpreter.set_tensor(input_details[0]['index'], x_test_norm[0])
deltas = []

for _ in range(100):
    begin = datetime.datetime.now()
    interpreter.invoke()
    end = datetime.datetime.now()
    deltas.append(end-begin)

deltas = [x.total_seconds() for x in deltas]
print(sum(deltas)/len(deltas))

Unfortunately the record times are almost identical. My questions are: am I doing something wrong during the pruning phase? Shouldn’t the pruned model use sparse math operations and thus be much faster than the non pruned model? Is there a way to force Tensorflow lite to use sparse operators to decrease the inference time?

The two TensorflowLite models are being run on a Jetson Nano with Tensorflow version 2.4.1 which is not the latest version but it is fully supported.

It seems you are using random unstructured pruning, which I don’t think provides any acceleration. You could use structural (e.g. 2:4) pruning:

They mention:

Compare to the random sparsity, the structured sparsity generally has lower accuracy due to restrictive structure, however, it can reduce inference time significantly on the supported hardware.

Also, I am not sure if CPUs can benefit from sparse inference. The supported hardware I know of is CUDA GPUs:

I tried applying structural pruning

batch_size = BATCH_SIZE
epochs = 50
pruning_params_2_by_4 = {
    'sparsity_m_by_n': (2, 4),
}
model_for_pruning = prune_low_magnitude(model, **pruning_params_2_by_4)
model_for_pruning.compile(optimizer='adam',
              loss=custom_loss,
              metrics=['accuracy'])

But the execution time still doesn’t improve, with an average of 2.01986 ms for the pruned model and 2.07721 ms for the regular one.

Regarding cuSPARSELt the problem is that I don’t want to use CUDA directly, I would like to use Tensorflow on my GPU.
But I am wondering if either TF or TFLite exploit the sparsity of the model to improve the execution times or if this is not a behaviour that I should expect.