Error occurred when finalizing GeneratorDataset iterator

Hi again :slight_smile:

Thanks to all of your help, I can build Faster-RCNN model.
But it goes well except training step, and I hit the wall.
I debugged functions, so found a suspicious part, however I can’t catch what is root cause.

First, the version is:

tensorflow-gpu==2.5.0
CUDA==11.2.0
cuDNN==8.1.0.77

The whole tack trace is:

2021-06-17 16:46:58.163220: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-06-17 16:47:04.565562: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021-06-17 16:47:04.610494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2021-06-17 16:47:04.618787: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-06-17 16:47:04.634109: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-06-17 16:47:04.638019: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021-06-17 16:47:04.647980: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021-06-17 16:47:04.655237: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021-06-17 16:47:04.670023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021-06-17 16:47:04.679152: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021-06-17 16:47:04.685911: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-06-17 16:47:04.690109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-06-17 16:47:04.693839: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-17 16:47:04.703798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2021-06-17 16:47:04.711944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-06-17 16:47:05.263458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-17 16:47:05.268143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-06-17 16:47:05.270931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-06-17 16:47:05.273788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3983 MB memory) → physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
D:\dev\anaconda3\envs\dl_env\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py:3703: UserWarning: Even though the tf.config.experimental_run_functions_eagerly option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please use tf.data.experimental.enable.debug_mode().
warnings.warn(
WARNING:tensorflow:input_shape is undefined or non-square, or rows is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
2021-06-17 16:47:06.528692: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-06-17 16:47:07.040698: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-17 16:47:07.690211: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-06-17 16:47:08.195097: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
WARNING:tensorflow:From D:\dev\anaconda3\envs\dl_env\lib\site-packages\tensorflow\python\ops\array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The validate_indices argument has no effect. Indices are always validated on CPU and never validated on GPU.
2021-06-17 16:47:09.868833: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-06-17 16:47:09.872503: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-06-17 16:47:09.875622: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-06-17 16:47:09.886018: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘cupti64_112.dll’; dlerror: cupti64_112.dll not found
2021-06-17 16:47:09.898400: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘cupti.dll’; dlerror: cupti.dll not found
2021-06-17 16:47:09.902934: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1661] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-06-17 16:47:09.910278: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-06-17 16:47:09.914039: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1752] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
2021-06-17 16:47:09.953551: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/100
214/214 [==============================] - 60s 275ms/step - loss: 7.0047 - rpn_reg_loss: 0.0160 - rpn_cls_loss: 0.1079 - frcnn_reg_loss: 6.5493 - frcnn_cls_loss: 0.3315 - val_loss: 5.5020 - val_rpn_reg_loss: 0.0161 - val_rpn_cls_loss: 0.1088 - val_frcnn_reg_loss: 5.2288 - val_frcnn_cls_loss: 0.1484

And the error message is:

2021-06-17 16:48:09.849872: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

Suspicious code is:

@tf.function
def rpn_generator(dataset, anchors):
    while True:
        for data in dataset:
            image, gt_boxes, gt_labels = data
            bbox_deltas, bbox_labels = calculate_rpn_actual_outputs(anchors, gt_boxes, gt_labels)
            yield image, (bbox_deltas, bbox_labels)

Last, I referenced https://github.com/FurkanOM/tf-faster-rcnn
It must cause the same problem, because when I run that code, still get it.

I tried downgrading tensorflow to 2.4.0 and the error still occurred.
The strange thing is that when I run the code, the trained epoch has not consistency.
For example, at first run, train stopped at 2 epoch, and next run, train stopped at 13 epoch.

I thought the problem is my gpu(GTX 1660 Ti) memory, but running the code has taken about 55% of gpu memory.

I found it.
I gave wrong parameter to call backs tf.keras.callbacks.ReduceLROnPlateau of model.fit() , that’s why the train stopped when epoch ends.
Thanks to all again :slight_smile:
But I don’t know why referenced code caused error until now… mysterious…

1 Like

where is the code? can be more detail?

I have the same problem, the difference is that this code will report this error as soon as it runs, how to solve this problem, can be detail?