Cannot run on Nvidia GPU

FuJa0815 · January 24, 2022, 3:57am

Hello!

I’m pretty new to TensorFlow and I am trying to classify german words to their gramatical gender.
My problem is that TensorFlow crashes with an error that I can’t find online.

My python code

import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras import layers

batch_size = 1

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "D:/artikelguesser/train",
    labels='inferred',
    class_names=["feminine","masculine","neuter"],
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=25565
)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "D:/artikelguesser/train",
    labels='inferred',
    class_names=["feminine","masculine","neuter"],
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=25565
)

raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "D:/artikelguesser/test", batch_size=batch_size
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

max_features = 32906
embedding_dim = 128
sequence_length = 20

vectorize_layer = TextVectorization(
    standardize=None,
    max_tokens=max_features,
    output_mode="int",
    split=None,
    output_sequence_length=sequence_length,
)

text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)
print("Vectorized!")

train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)
print("Prefetched!")

inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

predictions = layers.Dense(3, activation="sigmoid", name="predictions")(x)

model = tf.keras.Model(inputs, predictions)

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
print("Model compiled!")

model.fit(train_ds, validation_data=val_ds, epochs=3)

print("DONE!")

Output

Found 41132 files belonging to 3 classes.
Using 32906 files for training.
2022-01-23 21:30:44.635916: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-23 21:30:45.176399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4639 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 980 Ti, pci bus id: 0000:03:00.0, compute capability: 5.2
Found 41132 files belonging to 3 classes.
Using 8226 files for validation.
Found 4569 files belonging to 3 classes.
Number of batches in raw_train_ds: 32906
Number of batches in raw_val_ds: 8226
Number of batches in raw_test_ds: 4569
Vectorized!
Prefetched!
Model compiled!
Epoch 1/3
2022-01-23 21:31:29.712176: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8101
2022-01-23 21:31:30.866313: F tensorflow/stream_executor/cuda/cuda_dnn.cc:570] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 1 feature_map_count: 128 spatial: 1 0  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

Version stuff

GPU: GTX 980 TI
GPU driver: 511.23
TF: 2.7.0
OS: Windows 10.0.19043
CUDA: 11.2
cuDNN: 8.1.0.77
Python: 3.9.6

I’ve noticed that there are two folders in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\lib: Win32 and x64. I’m running a 64 bit Windows. Is this normal?

Bhack · January 24, 2022, 12:33pm

What Is you TF version?

FuJa0815 · January 24, 2022, 1:37pm

See the original question.

FuJa0815 · February 1, 2022, 4:32pm

Am I the only one having this problem? Is there something I can try? I have no idea what that error is supposed to mean.

Bhack · February 1, 2022, 4:48pm

I think that you could try to share the cuDNN log as mentioned at Crash when using tf.nn.local_response_normalization across multiple GPUs · Issue #48057 · tensorflow/tensorflow · GitHub

emrullah_polat · February 2, 2022, 2:09pm

maybe you can check GPU on code.

from tensorflow.python.client import device_lib
import tensorflow as tf

def get():
    local_devices = device_lib.list_local_devices()
    for x in local_devices:
        if x.device_type == "GPU":
            print(x.name)
            
get()

print("You are using TensorFlow version", tf.__version__)
if len(tf.config.list_physical_devices('GPU')) > 0:
    print("You have a GPU enabled.")
else:
    print("Enable a GPU before running this notebook.")

FuJa0815 · February 2, 2022, 2:40pm

Enabling both error and warning logging produces an empty file, enabling debug logging produces the following log: I! CuDNN (v8100) function cudnnCreate() called:i! handle: location=host; - Pastebin.com

FuJa0815 · February 2, 2022, 2:41pm

2022-02-02 15:41:20.487719: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-02 15:41:21.016548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /device:GPU:0 with 4639 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 980 Ti, pci bus id: 0000:03:00.0, compute capability: 5.2
/device:GPU:0
You are using TensorFlow version 2.8.0
You have a GPU enabled.

FuJa0815 · February 2, 2022, 6:29pm

UPDATE:

The error has nothing to do with CUDA.
I’ve added os.environ['CUDA_VISIBLE_DEVICES'] = '-1' to force calculation on my CPU but the code still crashes on the model.fit line. The code crashes with an -1073741819 error code, which is apparantly an access violation. I’ve already tried running the code as administrator.

Edit:
I’ve went through the code via debugger and the code crashes on the following line:

quick_execute, execute.py:54
call, function.py:499
_call_flat, function.py:1853
__call__, function.py:2956
_call, def_function.py:980
__call__, def_function.py:915
error_handler, traceback_utils.py:150
fit, training.py:1384
error_handler, traceback_utils.py:64
<module>, main.py:89

pywrap_tfe.TFE_Py_Execute is getting executed with the following parameters:

ctx._handle - <capsule object NULL at 0x0000014B9974CC60>
device_name - ''
op_name - '__inference_train_function_67184'
inputs - [<tf.Tensor: shape=(), dtype=resource, value=<Resource Tensor>>, <tf.Tensor: shape=(), dtype=variant, value=<[empty]>>, <tf.Tensor: shape=(), dtype=resource, value=<Resource Tensor>>, <tf.Tensor: shape=(), dtype=resource, value=<Resource Tensor>>, ...]
attrs - ('executor_type', '', 'config_proto', b'\n\x07\n\x03CPU\x10\x01\n\x07\n\x03GPU\x10\x002\x02J\x008\x01\x82\x01\x00')
num_outputs - 2

Bhack · February 2, 2022, 11:06pm

Just to check that it is only related to the windows env can you try to run the same code in our prepared Docker container on the same machine: