Fastest way to load_model for inference

joanfihu · November 12, 2021, 5:14pm

Hi there,

I’m trying to quickly load a model to make predictions in a REST API. The tf.keras.models.load_model method takes ~1s to load so it’s too slow for what I’m trying to do.

What is the fastest way to load a model for inference only?

I know there is a TFX serving server to just do this efficiently but I’ve already have a REST API for doing other things. Setting up a specialised server just for predictions feels like an overkill. How is the TFX server handling this?

Thanks in advance,
Joan

Bhack · November 13, 2021, 2:42am

Setting up a specialised server just for predictions feels like an overkill.

I don’t think that it is overkill and probably the alternatives are not so simpler than Tensorflow Serving with Docker
E.g. see:

joanfihu · November 13, 2021, 8:03am

Yes there might be no other way. However, I’m not sure if the TFX server loads a model from disk in every request?

What I’m trying to achieve is to either find a very quick way to load a model from disk or keep the model in memory somehow so it doesn’t need to be loaded in every request.

I also tried caching but pickle deserialisation is very expensive and adds ~1.2s. I suspect the built-in load model does some sort of serialisation too, which seems to be the killer.

Robert_Crowe · November 16, 2021, 11:25pm

I think that you want to use TensorFlow Serving. If your model is small enough to keep in memory, it will. You can also do SavedModel Warmup.

joanfihu · November 20, 2021, 1:49pm

Hey Robert, thanks for your reply. Yes, I ended up loading the model in memory on server initialisation. It’s a small model so works well this way.