CNN network causes kernel die

Umut_Koksoy · June 21, 2023, 4:01am

I have windows 11 installed and using tensorflow 2.12 on wsl.
CUDA Version: 11.8 is installed
I can train deep neural networks successfully with my GPU. However, when I try to use CNN network, system crashes. As an example; I can train mnist database with below code

import tensorflow as tf

from tensorflow import keras

mnist = keras.datasets.mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images / 255.0

test_images = test_images / 255.0

model = keras.Sequential([

keras.layers.Flatten(input_shape=(28, 28)),

keras.layers.Dense(128, activation=‘relu’),

keras.layers.Dense(10, activation=‘softmax’)

])

model.compile(optimizer=‘adam’, loss=‘sparse_categorical_crossentropy’, metrics=[‘accuracy’])

model.fit(train_images, train_labels, epochs=5)

test_loss, test_acc = model.evaluate(test_images, test_labels)

print(‘Test accuracy:’, test_acc)

----------------------------CNN IMPLEMENTATION------------------
However, kernel dies when I run below code

import tensorflow as tf
from tensorflow import keras
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape(-1, 28, 28, 1)
test_images = test_images.reshape(-1, 28, 28, 1)
train_images = train_images / 255.0
test_images = test_images / 255.0
model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation=‘relu’, input_shape=(28, 28, 1)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Flatten(),
keras.layers.Dense(128, activation=‘relu’),
keras.layers.Dense(10, activation=‘softmax’)
])
model.compile(optimizer=‘adam’, loss=‘sparse_categorical_crossentropy’, metrics=[‘accuracy’])
model.fit(train_images, train_labels, epochs=5)
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(‘Test accuracy:’, test_acc)

chunduriv · June 21, 2023, 6:58am

@Umut_Koksoy,

Welcome to the Tensorflow Forum!

Thank you for taking the time to report the issue. We will check and update you.

Umut_Koksoy · June 21, 2023, 4:47pm

I am getting a new error before kernel dies. It may help to find out the problem.

---------------------------------------.----------------------------------------------------------------

2023-06-21 20:31:38.325419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3893 MB memory: → device: 0, name: NVIDIA GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5 2023-06-21 20:31:38.529059: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 188160000 exceeds 10% of free system memory.

Epoch 1/5

2023-06-21 20:31:38.672844: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 188160000 exceeds 10% of free system memory.

---------------------------------------.----------------------------------------------------------------
Training mnist dataset should not consume too much memory.
I am using tf 2.12 I wrote below code before training

devices = tf.config.list_physical_devices(‘GPU’)

device = devices[0]

tf.config.experimental.set_memory_growth(device, True)

is_memory_growth_enabled = tf.config.experimental.get_memory_growth(device)

print(f’Memory growth for {device} is: {is_memory_growth_enabled}')

Print out - > Memory growth for PhysicalDevice(name=‘/physical_device:GPU:0’, device_type=‘GPU’) is: True

chunduriv · June 22, 2023, 12:49pm

@Umut_Koksoy,

Could you please add the following line at the beginning of the program and let us know?

import os
os.environ["tf_gpu_allocator"]="cuda_malloc_async"

Thank you!

chunduriv · June 23, 2023, 6:40am

@Umut_Koksoy,

While reproducing the issue i also observed the same behaviour with Tensorflow 2.12.

Thank you!

Umut_Koksoy · June 23, 2023, 7:18am

More information about cuda and cudnn versions installed…

Umut_Koksoy · June 28, 2023, 3:00pm

Hi @chunduriv ,

I am not familiar with the tensorflow forum and I would like to know if tensorflow team gets feedback about the issues on this platform to fix or do we need to report it on another platform?

In this case, the code works properly on google colab. tensorflow version and cuda version that is installed on my computer are same as the versions on google colab.

I think there is something missing on tensorflow installation guide on wsl. I dont think that there is a problem with tf version 2.12

chunduriv · June 30, 2023, 5:55am

@Umut_Koksoy,

Yes, we have issues with WSL2 only. There is no action required on your end. We reported it to the concerned team and are awaiting the fix.

Thanks!

Umut_Koksoy · July 24, 2023, 2:44pm

Hi @chunduriv

After the new release of tensorflow 2.13.0, I tried it on WSL 2. The problem is still not fixed.

Abhijit_Mustafi · August 28, 2023, 3:44am

I have exactly the same issue and it is not resolved as on date. I am concerned why does jupyter lab --debug not show any error repports as the kernel dies.

Nikita_Krotenko · September 6, 2023, 8:51am

Hi @Umut_Koksoy

Did you manage to find any workaround to resolve this issue?

Umut_Koksoy · September 6, 2023, 9:15am

Hi @Nikita_Krotenko
I couldn’t find any solutions on WSL2 therefore I installed linux.
So you cannot run any CNN networks with tensorflow version above 2.10 on any windows machines currently.

Abhijit_Mustafi · September 9, 2023, 8:17am

Very disconcerting that this issue is not getting enough traction on this forum . Very limiting when you withdraw support for native windows and leave users with a buggy implementation. I guess we need to wait this one out.

zbrusher · September 25, 2023, 4:07am

same here, very disappointed. I have the same issue, I thought my laptop’s gpu was damaged, got a desktop and kernel keeps dying. I installed it as note in your page:

conda install -c conda-forge cudatoolkit=11.8.0python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.*mkdir -p $CONDA_PREFIX/etc/conda/activate.decho 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.shecho 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.shsource $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh# Verify install:python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

while it works, when loading a model and doing a simple predict, kernel keeps dying. Any solution for this?

Kiran_Sai_Ramineni · January 24, 2024, 11:11am

Hi @Umut_Koksoy, @Abhijit_Mustafi, I have tried to train the model in Jupyter notebook with cnn layers using tensorflow 2.15 with GPU on WSL2. I did not face any error.

Could you please try with the latest version of tensorflow and let us know if the issue is resolved or not.
Thank You.

Umut_Koksoy · January 24, 2024, 6:27pm

Hi @Kiran_Sai_Ramineni , thank you the problem is resolved with tensorflow 2.15

Abhijit_Mustafi · March 4, 2024, 5:32pm

Unfortunately with Tensorflow 2.15 the notebook stays completely unresponsive when I run the cell to fit the model containing a CNN. Dense models work just fine and my GPU is detected as well. Very frustrating still.

Abhijit_Mustafi · March 5, 2024, 6:25pm

Hi @Kiran_Sai_Ramineni @Umut_Koksoy I guess really need your help on this and sorry for the long post. But I am really stuck and have some deadlines approaching.
This is what is detected

and as you can see I have a straightforward model and the summary also runs

But as soon as I run the fit command, the notebook just does nothing and the console shows no error messages at all

As I have said previously Dense models run just fine.

Please help me with this.

Kiran_Sai_Ramineni · March 7, 2024, 6:22am

Hi @Abhijit_Mustafi, I have tried to run the CNN model with tensorflow 2.15.0 with GPU in jupyter notebook and did not face any error.

Could you please try to create a new environment and install tensorflow and CUDA using pip install tensorflow[and-cuda] and try to run the CNN the model. Thank You.

Abhijit_Mustafi · March 8, 2024, 6:46am

Interesting observation. Created a fresh environment as per your advice. Simple CNN models now run (two layers with 32 filters in each followed by Flatten). But increasing the number of layers kills the kernel. No error messages on the console either.