Parallel inferencing - image classification model

IamExperimenting_Now · June 27, 2022, 3:05am

I have built an image classification model using TF2. I need to do batch prediction. I’m using p3.8xlarge ec2 instance which has 4 GPU and 32vCPU core. My model size is 40MB.

I have a question, is it possible to attach the model to 4GPU? like, replicate the model architecture to 4GPU and run the inference parallelly.

Let’s say, If I pass 16 samples to the prediction endpoint machine, in that machine model has been loaded to the 4GPU instances. can I split the 16 samples into 4x4 and do the prediction at the same time?

Ramyar_Jahani · June 28, 2022, 7:21am

Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes.