JIT Compilation failure with cudatoolkit installed via conda

limadev · July 12, 2022, 3:42am

Hi everyone,
I am investigating a problem that seems to be related between a mismatch between tensorflow and cudatoolkit installed via conda. I was trying to run the code from the simcl official repository. I installed a tensorflow using a conda environment created following the official documentation. When I tried to run a training as specified here in the pretraining session using a Single GPU configuration, I started to see a JIT Compilation error and a failure related to libdevice not being found, as we can see below:

error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
Traceback (most recent call last):
  File "/home/matheus/development/simclr/tf2/run.py", line 671, in <module>
    app.run(main)
  File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/matheus/development/simclr/tf2/run.py", line 647, in main
    train_multiple_steps(iterator)
  File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'mod' defined at (most recent call last):
    File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 930, in _bootstrap
      self._bootstrap_inner()
    File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
      self.run()
    File "/home/matheus/development/simclr/tf2/run.py", line 572, in single_step
      should_record = tf.equal((optimizer.iterations + 1) % steps_per_loop, 0)
Node: 'mod'
Detected at node 'mod' defined at (most recent call last):
    File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 930, in _bootstrap
      self._bootstrap_inner()
    File "/home/matheus/miniconda3/envs/simclr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
      self.run()
    File "/home/matheus/development/simclr/tf2/run.py", line 572, in single_step
      should_record = tf.equal((optimizer.iterations + 1) % steps_per_loop, 0)
Node: 'mod'
2 root error(s) found.
  (0) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
         [[Func/while/body/_1/image/write_summary/summary_cond/then/_894/input/_907/_26]]
  (1) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_multiple_steps_17859]

After a lot of back and forth, it looks like the problem is with the conda package of cuda toolkit. It looks like tensorflow looks for a tool called libdevice in the directory ${CUDA_DIR}/nvvm/libdevice, as we can see here. The main problem is that tensorflow seems look for cuda at /usr/local/cuda according to this file.
If that is true, how tensorflow is able to look at cudatoolkit installed using conda since the binaries are stored in a different path, like ~/miniconda/envs/{my_env}/lib?
Also, I was taking a look at the conda-forge cudatoolkit repository and found something interesting. Looks like the package copies the file /nvvm/libdevice directly into the lib folder of cudatoolkit and tensorflow is not able to find it later because it does not keep the folder structure. Does that make sense?
I am interested in contribute to a solution for this issue with a PR if it is the case.
Appreciate any help here.

Renu_Patel · September 25, 2023, 9:09am

Hi @limadev

Welcome to the TensorFlow Forum!

Please provide some more details on issue which system OS you are using as there will be slightly different instructions to install tensorflow in different OS. Please check and follow the step by step instructions by following this TF Install official page as per your system OS to install Tensorflow.

Please verify all the Hardware/Software requirements are satisfied and ensure that you are installing the correct compatible version of CUDA and cuDNN to the specific TensorFlow version as mentioned in this tested build configuration and set the path of these software.

Please let us know if the issue still persists. Thank you