Slow (2x30s) model load on VM pass-through GPU with NVLink

I’m running TensorFlow on NVLink’ed GPUs passed through individually to VMs. I’m only sometimes running them in this configuration, but I do want to solve an issue that keeps cropping up with it, for use with other frameworks as well.

It seems that, when the 2nd GPU of an NVLink’ed pair (passed through individually, at the moment) is initialized, (when trying torch, at the torch.cuda() call), there are 2 ioctls to /dev/nvidia, which each timeout at 30 seconds (one after the other).

I noticed that installing a more recent version of TensorFlow fixed this symptom, so I’m curious:

  • What might’ve changed in terms of NVIDIA: libraries used, drivers loaded, or timeouts configured, between versions?
  • Is this likely addressed in TensorFlow? Or in an upstream dependency: nvcc arguments or defaults, CuDNN use or options used at compile time, or something else?

Slow, before:

ubuntu@host:~$ python -c 'from pprint import pprint as print; import tensorflow as tf; print([(k,v) for k,v in tf.version.__dict__.items() if "VERSION" in k])'[('COMPILER_VERSION', '9.3.0'),
 ('GIT_VERSION', 'unknown'),
 ('GRAPH_DEF_VERSION', 808),
 ('GRAPH_DEF_VERSION_MIN_CONSUMER', 0),
 ('GRAPH_DEF_VERSION_MIN_PRODUCER', 0),
 ('VERSION', '2.6.0')]

Fast, after:

(tf2) ubuntu@host:~$ python -c 'from pprint import pprint as print; import tensorflow as tf; print([(k,v) for k,v in tf.version.__dict__.items() if "VERSION" in k])'
[('COMPILER_VERSION', '7.3.1 20180303'),
 ('GIT_VERSION', 'v2.7.0-rc1-69-gc256c071bb2'),
 ('GRAPH_DEF_VERSION', 898),
 ('GRAPH_DEF_VERSION_MIN_CONSUMER', 0),
 ('GRAPH_DEF_VERSION_MIN_PRODUCER', 0),
 ('VERSION', '2.7.0')]