Nvidia-smi does not show ML tasks

on nvidia equipped machine i can run the training now with tf2.6 but nvidia-smi does not show anything. If the ML code is pushing kernel onto GPU, I can see it shows up along PID (below)
how do i verify that training is done on GPU or CPU?
As a comparison, I can run simple vector algebra kernel on GPU by explicitly pushing to GPU and I can see nvidia-smi rightly shows the name of the executable (a.out):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 40%   31C    P2    52W / 215W |    298MiB /  7981MiB |     49%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       973      G   /usr/lib/xorg/Xorg                 29MiB |
|    0   N/A  N/A      1207      G   /usr/bin/gnome-shell                7MiB |
|    0   N/A  N/A      3085      C   ./a.out                           257MiB |
+-----------------------------------------------------------------------------+

Can you verify that you have visibile GPUs:

https://www.tensorflow.org/api_docs/python/tf/config/get_visible_devices

Ok, I see some problem there, seems lot of path problems, thanks for directing

>>> tf.config.get_visible_devices()
2021-10-23 08:33:36.075703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-23 08:33:36.076294: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.076383: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.076443: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.110913: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111172: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111402: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111443: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

I did install cuda, not sure why it is not finding libraries:
528 sudo apt update && sudo apt install cuda-10-1
529 sudo apt install libcudnn7

find /usr/local -name libcudart.so.11.0 (finding none)

If you can It easier to have all the dependecies ready with our official Docker images ΨΉΨ§Ω…Ω„ Ω…ΩŠΩ†Ψ§Ψ‘ Β |Β  TensorFlow

looks like it is expecting somethign and I am finding something different:
root@nonroot-MS-7B22:~/dev-learn/gpu/cuda/linux/cuda-by-example/p188# tree -f /usr/local | grep -i  libcudart
β”‚   β”‚   β”‚       β”œβ”€β”€ /usr/local/cuda-10.1/doc/man/man7/libcudart.7
β”‚   β”‚   β”‚       β”œβ”€β”€ /usr/local/cuda-10.1/doc/man/man7/libcudart.so.7
β”‚   β”‚           β”œβ”€β”€ /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so -> libcudart.so.10.1
β”‚   β”‚           β”œβ”€β”€ /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 -> libcudart.so.10.1.243
β”‚   β”‚           β”œβ”€β”€ /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243
β”‚   β”‚           β”œβ”€β”€ /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart_static.a
β”‚           β”‚   β”‚   β”œβ”€β”€ /usr/local/lib/python3.6/dist-packages/torch/lib/libcudart-6d56b25a.so.11.0
β”‚           β”‚   β”œβ”€β”€ /usr/local/lib/python3.6/dist-packages/torchvision.libs/libcudart.05b13ab8.so.11.0

docker is definitely an option but I also want to find issues in manual usage.

why is it specifically looking for version 11 when there are other versions available?

If you don’t want to use the official Docker image you need to solve the specified requirements manually:

ok, now i isntalled 11.1 and 8.2 respectively and then this :slight_smile:
591 sudo apt install cuda-11-2
594 apt install libcudnn8

>>> tf.config.get_visible_devices()
2021-10-23 09:06:29.654745: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2021-10-23 09:06:29.654790: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: nonroot-MS-7B22
2021-10-23 09:06:29.654800: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: nonroot-MS-7B22
2021-10-23 09:06:29.654840: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 495.29.5
2021-10-23 09:06:29.654863: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.57.2
2021-10-23 09:06:29.654869: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 470.57.2 does not match DSO version 495.29.5 -- cannot find working devices in this configuration
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

Please check the minimum required Nvidia driver and CUDA versions specified in the previous link.

1 Like

I reinstalled everything inckuding ubuntu otherwise older nvidia apps does not seem to be compeltely removed. It works now! thx
Β±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 970 G /usr/lib/xorg/Xorg 100MiB |
| 0 N/A N/A 1163 G /usr/bin/gnome-shell 48MiB |
| 0 N/A N/A 1430 G …setup/gnome-initial-setup 2MiB |
| 0 N/A N/A 4370 C python3 6941MiB |
1563/1563 [==============================] - 5s 3ms/step - loss: 0.7987 - accuracy: