Parallel inferencing - image classification model

I have built an image classification model using TF2. I need to do batch prediction. I’m using p3.8xlarge ec2 instance which has 4 GPU and 32vCPU core. My model size is 40MB.

I have a question, is it possible to attach the model to 4GPU? like, replicate the model architecture to 4GPU and run the inference parallelly.

Let’s say, If I pass 16 samples to the prediction endpoint machine, in that machine model has been loaded to the 4GPU instances. can I split the 16 samples into 4x4 and do the prediction at the same time?

Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes.