How to properly deploy Keras models for inference in Python?

I’m deploying a big Keras model to production and it’s not very clear to me if I should do anything to it to make it more efficient for inference. In tf-v1 I used to prepare frozen graphs, but this is being deprecated in tf-v2 (as far as I understand).

Right now, I’m just loading the model and using the .predict() method to perform inference on .tfrec files.

from tensorflow import keras
model = keras.models.load_model('path/to/location')
model.predict(get_dataset(tfrecord_list, batch_size), verbose=0)

This model won’t be trained anymore, so I wanted to understand if there are any production-specific steps that I should do to increase the model computational performance (if possible).


Hi @apcamargo please refer to this documentation to know how to deploy sever you model using TensorFlow serving.

HI @Kiran_Sai_Ramineni. TensorFlow serving is aimed towards web services, right? My model will be distributed within a Python package, so I’m not sure if that’s the way to go

Other then optimizing the model itself you can try to jit_compile your model:

Thanks, @Bhack!. jit_compile doesn’t work with my model. I still haven’t figured out why.

Generally you have a message in the log about something not supported.

I’m getting a InvalidArgumentError error.

InvalidArgumentError: Graph execution error:

Detected at node 'StatefulPartitionedCall' defined at (most recent call last):

I’m not posting the whole log here because that’s out of topic. I’m using a complex custom layer that is probably causing this. I’ll try to figure out the root of the problem.

You can limit the compilation scope on critical functions with @tf.function(jit_compile=True):

But I suggest also to profile your model to understand what happens: