How to deploy tf-serving for maximum throughput for inference on metal and kubernetes?

naveen_marthala · December 25, 2021, 5:44pm

What’s the proper/best architecture at server side to serve deep-learning models written mostly in tf.keras with tensorflow-serving? Is it advisable to use any specifiic web-framework(e.g., fastAPI, Flask)? Do I use tf-serving along with a WSGI(e.g,. gunicorn) with some set configuration(like worker type)? Is it advised to put tf-serving(or gunicron with tf-serving or gunicorn with fastapi with tf-serving) behind some web server or reverse proxy like nginx(or nginx API Gateway)?

I would like to know what type of configurations have worked for most people, to take deployment decisions at my side accordingly!!

I need to serve multiple deep-learning models at the same time, some on AWS EC2, some using kubernetes on AWS EKS. the models change pretty quickly(some need to be changed as soon as every week while some go on for months or even a couple years). some models will be accessed by hundreds of thousands of people every second, while some models can be served on a single machine. so, tf-models usage is relatively atypical at where I need to deploy.