Fastest way to load_model for inference

Hi there,

I’m trying to quickly load a model to make predictions in a REST API. The tf.keras.models.load_model method takes ~1s to load so it’s too slow for what I’m trying to do.

What is the fastest way to load a model for inference only?

I know there is a TFX serving server to just do this efficiently but I’ve already have a REST API for doing other things. Setting up a specialised server just for predictions feels like an overkill. How is the TFX server handling this?

Thanks in advance,
Joan

Setting up a specialised server just for predictions feels like an overkill.

I don’t think that it is overkill and probably the alternatives are not so simpler than Tensorflow Serving with Docker
E.g. see:

Yes there might be no other way. However, I’m not sure if the TFX server loads a model from disk in every request?

What I’m trying to achieve is to either find a very quick way to load a model from disk or keep the model in memory somehow so it doesn’t need to be loaded in every request.

I also tried caching but pickle deserialisation is very expensive and adds ~1.2s. I suspect the built-in load model does some sort of serialisation too, which seems to be the killer.

I think that you want to use TensorFlow Serving. If your model is small enough to keep in memory, it will. You can also do SavedModel Warmup.

Hey Robert, thanks for your reply. Yes, I ended up loading the model in memory on server initialisation. It’s a small model so works well this way.