Tensorflow running slower after cuda installation

Hello everyone,

Recently I’ve been struggling to install tensorflow with cuda and being able to run locally. Finally, I’ve managed to install it on windows, but the strange thing is that now running algorithm in the gpu is slower than running it in the cpu prior to cuda installation. When I’m running it now the gpu is two times faster than the cpu, but before the cuda installation, the cpu was running faster.

I’m using tensorflow 2.5.0. My computer is the gigabyte aero 15 kb with a rtx 2060. And cuda toolkit version 11.4.

Any suggestions?
Thank you.

Can you check if the GPU is visibile?

https://www.tensorflow.org/api_docs/python/tf/config/list_physical_devices

https://www.tensorflow.org/api_docs/python/tf/config/list_logical_devices

1 Like

Sure, it is detecting the device, this is the message:

2021-09-13 17:28:28.989743: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021-09-13 17:28:29.026491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021-09-13 17:28:29.026838: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-09-13 17:28:29.034672: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-09-13 17:28:29.034878: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021-09-13 17:28:29.040835: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021-09-13 17:28:29.041969: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021-09-13 17:28:29.045283: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021-09-13 17:28:29.048467: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021-09-13 17:28:29.049073: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-09-13 17:28:29.049289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-09-13 17:28:29.049802: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 17:28:29.051009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021-09-13 17:28:29.051570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-09-13 17:28:29.555876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 17:28:29.556016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-09-13 17:28:29.556130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-09-13 17:28:29.556460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3961 MB memory) → physical GPU (device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)

The strange thing I mention is that I lost processing speed in the CPU after CUDA installation, I use:

os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘-1’

To avoid using the GPU, and the processing speed is slower than it was prior to the installation of CUDA.

Using the commands you tell me:

tf.config.list_physical_devices()

[PhysicalDevice(name=’/physical_device:CPU:0’, device_type=‘CPU’), PhysicalDevice(name=’/physical_device:GPU:0’, device_type=‘GPU’)]

tf.config.list_logical_devices()

[LogicalDevice(name=’/device:CPU:0’, device_type=‘CPU’), LogicalDevice(name=’/device:GPU:0’, device_type=‘GPU’)]

Ok the GPU is visibile.

Can you double check the device placement with:

https://www.tensorflow.org/api_docs/python/tf/debugging/set_log_device_placement

Also It could be nice if you can check if you have the same issue with TF 2.6.0

1 Like

This is the result for v2.6:

print(tf.version)
tf.debugging.set_log_device_placement(True)

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

2.6.0
2021-09-13 18:19:57.393346: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-09-13 18:19:57.393826: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
2021-09-13 18:19:57.394068: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)

This is the result for v2.5:

tf.debugging.set_log_device_placement(True)

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

2021-09-13 18:06:54.092293: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-09-13 18:06:54.625367: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)

Can you set CUDA_VISIBLE_DEVICES="" with 2.6 and check if Is it slower with your example?

Just checked in v2.6

with CPU:

Epoch 2/25
352/352 [==============================] - 0s 1ms/step - loss: 0.3787 - accuracy: 0.8329 - val_loss: 0.4108 - val_accuracy: 0.8136

with GPU:

Epoch 2/25
352/352 [==============================] - 2s 5ms/step - loss: 0.3806 - accuracy: 0.8317 - val_loss: 0.4102 - val_accuracy: 0.8152

Well at least it’s faster with the CPU, not like in v2.5 :joy:

Any idea of what’s happening?

I’m using v2.5 because I’m preparing the tensorflow certification exam and it is the version they ask to use.

Can you make the same test with:

tensorflow-cpu · PyPI ?

Also you can try with specific TF 2.5.0 AMD or Intel wheels at:

I get the same result with tf2.6 cpu version:

Epoch 2/25
352/352 [==============================] - 0s 1ms/step - loss: 0.3810 - accuracy: 0.8307 - val_loss: 0.4221 - val_accuracy: 0.7988

And also for tf2.5 cpu version it’s not getting slower anymore:

Epoch 2/25
352/352 [==============================] - 1s 2ms/step - loss: 0.3788 - accuracy: 0.8329 - val_loss: 0.4107 - val_accuracy: 0.8124

thank you very much!! :relaxed:

1 Like