Tensorflow not detecting GPU in Ubuntu 20.04 while using Conda

Luiz_Costa · April 18, 2024, 7:27am

I’m a ML researcher trying to get Tensorflow to detect a RTX 3060 GPU for training models. I’m confident it is a version mismatch, however I cannot find what software has the wrong version.

Specs:
OS: Ubuntu 20.04 LTS
GPU: RTX 3060
CUDA: version 12.2
Tensorflow: version 2.12.0 (latest version available in conda)

I tried to upgrade TF to version 2.15.0 using pip instead of conda and this completly broke my Tensorflow install inside conda’s base env.

Running ‘nvidia-smi’ displays my gpu and cuda version correctly.

Running 'python -c “import tensorflow as tf” ’ does not work anymore, as it gives an error about mismatching dependencies (namely libgcc, but there may be more errors)

Running ‘conda install cuda’ as instructed by CUDA docs does not solve the issue

Running 'conda install tensorflow[and-cuda] as instructed by Tensorflow docs does not solve the issue

There is a 2.15.0 tensorflow available in conda-forge, but even after installing it my TF install remains broken.

A complete reinstall of conda is an option but I’m avoiding it because of time constraints.

How can I fix the TF install AND have it detect GPUs in the scenario?

sotiris.gkouzias · April 18, 2024, 4:35pm

Welcome @Luiz_Costa to the TensorFlow community

You can try the following steps (it should take just a few minutes):

Create a fresh conda virtual environment and activate it,
pip install --upgrade pip,
pip install tensorflow[and-cuda],
Set environment variables:

Locate the directory for the conda environment in your terminal window by running in the terminal:

echo $CONDA_PREFIX

Enter that directory and create these subdirectories and files:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

# Restore original LD_LIBRARY_PATH
export LD_LIBRARY_PATH="${ORIGINAL_LD_LIBRARY_PATH}"

# Unset environment variables
unset CUDNN_DIR
unset PTXAS_DIR

Verify the GPU setup:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

I hope it helps!

Luiz_Costa · April 18, 2024, 8:53pm

Thank you for the quick reply. Unfortunately, this is a shared research machine in my institution and I don’t have enough disk space to fully execute step 3. I need more time to ask colleagues what can be removed to free up space. It might take a while.

Meanwhile, the tensorflow version being pulled by pip in step 3 is 2.13.1 which gives me this warning:

WARNING: tensorflow 2.13.1 does not provide the extra ‘and-cuda’

Is this relevant?

sotiris.gkouzias · April 19, 2024, 2:48am

The best approach is to completely remove the broken environment (it is not time consuming).

If you need to save a full list of the installed packages and their versions before removing the broken environment in order to reinstall them (where applicable) in a new, fresh conda virtual environment you can simply run:

conda list -e > requirements.txt

To completely remove the broken environment (replace <virtuaL_environment_name> with tha actual conda virtual environment name) just run:

conda remove -n <virtuaL_environment_name> --all

Recreate a new environment in order to create a fresh conda virtual environment with Python 3.11 which is compatible and tested by running:

conda create -n tf python=3.11

I hope it helps!

Luiz_Costa · April 23, 2024, 6:22pm

Thank you so much for your time @sotiris.gkouzias.

Unfortunately, time is something my team did not have, and we have decided to move to a Cloud-based GPU solution instead of training the model locally. We did have to pay, but it resulted in faster training time, and the fact we now had remote access to the GPU and could work from home during the weekend.

Despite this choice speeding up our research, it leaves this post without a solution for future readers.

For this my sincere apologies, it’s completely understandable if this post gets removed.
At least, now I know how fix the conda environment the next opportunity I get to work with the Ubuntu machine!

sotiris.gkouzias · April 23, 2024, 6:49pm

@Luiz_Costa this discussion can be useful in many ways. Actually there’s nothing to apologise for! Problem-solving is a skill no matter if you are a researcher, a developer or … an accountant! I wish you the best.