CUDA_ERROR_OUT_OF_MEMORY with Intel HD and 4 x Gtx1070

pro9793 · September 22, 2023, 4:09am

Hello! I had got the problem with setting up the Tensorflow with 4 GPUs GTX 1070. I’ve tried different variations of systems (Ubuntu, Debian 10/11/12, Windows 10/11/Server 2022), tried WSL with miniconda, tried docker - nothing, got the same error everywhere (LINUX, Windows WSL, Docker)

To make the long story short - If you have the ERROR “E Tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_OUT_OF_MEMORY: out of memory” when trying “tf.config.list_physical_devices()” - try DISABLE the Integrated INTEL Graphics in my case Intel HD Graphics 630 in Device manager if Windows, then disable nvidia cards after that enable them one by one in different order, but not enable the Intel - this should help! But you need to do this after every reboot

OS: Windows 11 (x64) 22H2 build 22621.2283 (updated September 2023) - WSL Version 2
Motherboard: Colorful Technology And Development Co.,LTD C.B250A-BTC PLUS
Chipset: Intel B250 (Kaby Lake)
CPU: Intel Core i5-7500 3.4Ghz 4cores Kaby Lake-S Socket H4 (LGA1151) Virtualization enabled
RAM: DDR4 SODIMM 4GB 2400Mhz
GPU:
1 x Intel HD Graphics 630 (Kaby Lake-S GT2) [Intel] PCIe v2.0 x0 (5.0 GT/s)
4 x Nvidia GTX1070 8GB Driver 522.06 CUDA 11.8:

1 x Nvidia GTX1070 8GB Driver PCIe v3.0 x16 (8.0 GT/s) @ x16 (2.5 GT/s)
3 x Nvidia GTX1070 8GB Driver PCIe v3.0 x16 (8.0 GT/s) @ x1 (2.5 GT/s)

Steps:
-Install fresh Windows 11 22H2
-Set Windows SWAP file to 40GB
-Install CUDA Toolkit 11.8, reboot
-Tried to allow access to the GPU performance counters to all users in development section of Nvidia control panel, reboot
-checked nvidia-smi and nvcc is working
-Install WSL 2 with Ubuntu, reboot
-add .wslconfig
[wsl2]
memory=2GB
swap=40GB
-Install Docker Desktop
->docker run -it --rm -p 8888:8888 --gpus all tensorflow/tensorflow:latest-gpu-jupyter
-try
{import tensorflow as tf
tf.config.list_physical_devices()}
-get errors

2023-09-21 17:26:00.060803: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

2023-09-21 17:26:11.186982: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[PhysicalDevice(name=‘/physical_device:CPU:0’, device_type=‘CPU’)]

Today I’ve tried to disable Intel HD and 3 of 4 GTX 1070 in device manager, then I enabled GTX1070 back and get things worked!

2023-09-21 18:29:33.655221: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-21 18:29:40.812272: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.812646: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.812926: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:03:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.813276: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.959611: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.959873: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.960084: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:03:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.960289: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.960490: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.960690: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:02:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.960888: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:03:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-09-21 18:29:40.961097: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name=‘/physical_device:CPU:0’, device_type=‘CPU’),
PhysicalDevice(name=‘/physical_device:GPU:0’, device_type=‘GPU’),
PhysicalDevice(name=‘/physical_device:GPU:1’, device_type=‘GPU’),
PhysicalDevice(name=‘/physical_device:GPU:2’, device_type=‘GPU’),
PhysicalDevice(name=‘/physical_device:GPU:3’, device_type=‘GPU’)]

#gpu #tensorflow #install

pro9793 · September 22, 2023, 8:04am

I’ve tried to reboot with disabled Intel HD - I get the error OUT OF MEMORY back. The solution is to boot the Windows with all cards enabled, then disable INTEL HD, then disable 3 of 4 GTX 1070 and enable 3 GTX 1070 in different order one by one. And only after these manipulations the tensorflow work and see my GPUS. This is not normal I suppose. Have any Idea what to do? I need to do this after every reboot.