Performance with GPU help

Hello,
Tensorboard recommendation on my trainer says:
" * 39.6 % of the total step time sampled is spent on ‘Kernel Launch’. It could be due to CPU contention with tf.data. In this case, you may try to set the environment variable TF_GPU_THREAD_MODE=gpu_private."

I tried to run with it, but it didn’t improve the performance.

I’ve also ran the trainer under NVIDIA’s profiler and I see a lot of gaps in the profiler (where the GPU is idle - 20ms+ ). The trainer also has some lookup operations happening on the CPU (as far as I understand), but I am not sure that explains the current situation.

Any guidance/assistance is more than welcomed