Cannot train on RTX 3090

Notoooriou5 · February 15, 2023, 11:44am

I know there are a thousand post about this sort of thing, and I’ve spent days and days trying to get this to work.
Somehow I got to benchmark on the cifar10 model about a week ago but I have had no luck since. GPU was detected but unable to load model into memory.
Originally I already had CUDA12.0 installed, thats what worked. I have since tried to downgrade to 11.2 unsuccessfully, now GPU is not detected at all in tensorflow 2.11

I have tried all I can think of, and all I could find online.
I am at my wits end with this after hours poured into this.
Any help would be greatly appreciated

chunduriv · February 15, 2023, 11:51am

@Notoooriou5,

Welcome to the Tensorflow Forum!

Could you please share details of your operating system and steps that you took to install Tensorflow?

Thank you!

Notoooriou5 · February 15, 2023, 1:17pm

Hi Chanduriv,
System is i7-7700k, 24gb ram, and rtx 3090, tensorflow-gpu 2.10, running in anaconda.
I have reverted to cuda 12.0 and my GPU is now detected.
However, when attempting to run a training example on the cifar-10 model, I am out of memory. (hard drive not RAM, my RAM seems to be untouched).
GPU is loaded with approx 18GB out of 24GB available

I have approx 7GB free on my disk and it is fully consumed and unable to carry out training.
Is this simply a case of not having enough empty disk space? Or is there something else I should check?

chunduriv · February 16, 2023, 12:21pm

@Notoooriou5,

You can try limiting gpu memory. Currently it can be handled in two ways
a) Turn on memory growth by calling [tf.config.experimental.set_memory_growth]. (tf.config.experimental.set_memory_growth | TensorFlow v2.11.0).
It allocates more memory as the process increases and demands extra memory
b) Set a hard limit on the total memory tf.config.set_logical_device_configuration(memory_limit=1024)

Thank you