Performance with GPU help

Tensorboard recommendation on my trainer says:
" * 39.6 % of the total step time sampled is spent on ‘Kernel Launch’. It could be due to CPU contention with In this case, you may try to set the environment variable TF_GPU_THREAD_MODE=gpu_private."

I tried to run with it, but it didn’t improve the performance.

I’ve also ran the trainer under NVIDIA’s profiler and I see a lot of gaps in the profiler (where the GPU is idle - 20ms+ ). The trainer also has some lookup operations happening on the CPU (as far as I understand), but I am not sure that explains the current situation.

Any guidance/assistance is more than welcomed