Efficiency of inference from multiple threads on a single GPU

NotBatman · October 27, 2023, 6:02pm

If I have multiple threads running inference on a system with a single GPU, does the GPU efficiently process the requests from multiple TFJS threads in parallel?, or is there an overhead cost to switching between/scheduling the different threads that may cause performance bottlenecks?

I am working specifically with Tensorflow.js with the webGPU backend

CharlesVelazquez · October 30, 2023, 4:47am

I’d love to know this too, my gut thought would be no, but I’m hoping someone with definitive tech info answers this. I’m thinking no because of my novice understanding of threading on the CPU versus how a Nvidia GPU card would put its multiple Cuda and Tensor cores to work.

Lin_Chen · October 30, 2023, 5:56pm

I haven’t executed an experiments for this, but, IIUC, there is no performance gain from multiple TFJS threads to schedule GPU works than single TFJS thread to schedule, because WebGPU itself is single queue and collects GPU works from all other threads.

Ben_Arnao · October 31, 2023, 4:18am

From my experiments at least if the batch/model is small enough, the cost of preparing tensors and such to be sent to GPU seems to outweigh the parallelism (if there is even any true GPU-level parallelism)

NotBatman · October 31, 2023, 7:19am

Thanks for the replies. I wanted to share my experiences just incase they might be helpful.

I did try loading up several threads all running distinct TFJS instances for inference and quickly ran into constant “GPU lost Context errors”, which I believe are out of memory issues - not positive on this, but the regular errors made the solution unworkable.

Now, I changed to having many threads making requests to a single thread which is responsible for all inference. This has eliminated all the errors… Yay! But, I bought a 13900k for this project and I am only using a fraction of it due to what I suspect are GPU bottlenecks.

I was hoping of upping my inference threads to 2 or 3 and maybe find a sweet spot that wouldn’t run into the memory issues and better utilize my CPU, but Lin_Chen’s comment leads me to believe this would probably be a waist of time to try.