TFlite model maker object_detector.create hang

Firstly, some details on my software/ hardware:
TF 2.5.0
TF model maker 0.3.2
GPU RX3080 16GB
32GB of DRAM
Ubuntu 18.04 LTS
Nvidia Driver Version: 471.68 CUDA Version: 11.4

I have been trying to train a custom model using “efficientdet_lite1” spec but somehow it always hang randomly mid way in
model = object_detector.create() call
and not always in the same epoch runs:
e.g.
109/176 [=================>…] - ETA: 46s - det_loss: 0.4381 - cls_loss: 0.2731…

I tried to debug by adding
tf.get_logger().setLevel(‘DEBUG’)

but nothing is printed when the hang happens.

Any hint welcome?

I have also tried TF 2.6.0, nightly-builds, etc

I think I may have found a fix.
After reading

I tried setting
CUDA_LAUNCH_BLOCKING =1

And I completed my training without any issue! Hope I can help anyone having the same issue.

1 Like