Nvidia-smi does not show ML tasks

guen_gn · October 23, 2021, 2:33pm

on nvidia equipped machine i can run the training now with tf2.6 but nvidia-smi does not show anything. If the ML code is pushing kernel onto GPU, I can see it shows up along PID (below)
how do i verify that training is done on GPU or CPU?
As a comparison, I can run simple vector algebra kernel on GPU by explicitly pushing to GPU and I can see nvidia-smi rightly shows the name of the executable (a.out):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 40%   31C    P2    52W / 215W |    298MiB /  7981MiB |     49%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       973      G   /usr/lib/xorg/Xorg                 29MiB |
|    0   N/A  N/A      1207      G   /usr/bin/gnome-shell                7MiB |
|    0   N/A  N/A      3085      C   ./a.out                           257MiB |
+-----------------------------------------------------------------------------+

Bhack · October 23, 2021, 3:01pm

Can you verify that you have visibile GPUs:

https://www.tensorflow.org/api_docs/python/tf/config/get_visible_devices

guen_gn · October 23, 2021, 3:34pm

Ok, I see some problem there, seems lot of path problems, thanks for directing

>>> tf.config.get_visible_devices()
2021-10-23 08:33:36.075703: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-23 08:33:36.076294: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.076383: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.076443: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.110913: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111172: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111402: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2021-10-23 08:33:36.111443: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

guen_gn · October 23, 2021, 3:36pm

I did install cuda, not sure why it is not finding libraries:
528 sudo apt update && sudo apt install cuda-10-1
529 sudo apt install libcudnn7

find /usr/local -name libcudart.so.11.0 (finding none)

Bhack · October 23, 2021, 3:36pm

If you can It easier to have all the dependecies ready with our official Docker images عامل ميناء | TensorFlow

guen_gn · October 23, 2021, 3:38pm

looks like it is expecting somethign and I am finding something different:
root@nonroot-MS-7B22:~/dev-learn/gpu/cuda/linux/cuda-by-example/p188# tree -f /usr/local | grep -i  libcudart
│   │   │       ├── /usr/local/cuda-10.1/doc/man/man7/libcudart.7
│   │   │       ├── /usr/local/cuda-10.1/doc/man/man7/libcudart.so.7
│   │           ├── /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so -> libcudart.so.10.1
│   │           ├── /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 -> libcudart.so.10.1.243
│   │           ├── /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243
│   │           ├── /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart_static.a
│           │   │   ├── /usr/local/lib/python3.6/dist-packages/torch/lib/libcudart-6d56b25a.so.11.0
│           │   ├── /usr/local/lib/python3.6/dist-packages/torchvision.libs/libcudart.05b13ab8.so.11.0

guen_gn · October 23, 2021, 3:38pm

docker is definitely an option but I also want to find issues in manual usage.

guen_gn · October 23, 2021, 3:39pm

why is it specifically looking for version 11 when there are other versions available?

Bhack · October 23, 2021, 3:41pm

If you don’t want to use the official Docker image you need to solve the specified requirements manually:

guen_gn · October 23, 2021, 4:08pm

ok, now i isntalled 11.1 and 8.2 respectively and then this
591 sudo apt install cuda-11-2
594 apt install libcudnn8

>>> tf.config.get_visible_devices()
2021-10-23 09:06:29.654745: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2021-10-23 09:06:29.654790: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: nonroot-MS-7B22
2021-10-23 09:06:29.654800: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: nonroot-MS-7B22
2021-10-23 09:06:29.654840: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 495.29.5
2021-10-23 09:06:29.654863: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.57.2
2021-10-23 09:06:29.654869: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 470.57.2 does not match DSO version 495.29.5 -- cannot find working devices in this configuration
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

Bhack · October 23, 2021, 4:33pm

Please check the minimum required Nvidia driver and CUDA versions specified in the previous link.

guen_gn · October 24, 2021, 6:17pm

I reinstalled everything inckuding ubuntu otherwise older nvidia apps does not seem to be compeltely removed. It works now! thx
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 970 G /usr/lib/xorg/Xorg 100MiB |
| 0 N/A N/A 1163 G /usr/bin/gnome-shell 48MiB |
| 0 N/A N/A 1430 G …setup/gnome-initial-setup 2MiB |
| 0 N/A N/A 4370 C python3 6941MiB |
1563/1563 [==============================] - 5s 3ms/step - loss: 0.7987 - accuracy: