Unable to train custom dataset for SSDMobileNetV1 +Tensorflow 1.15

Hi,
I want to train custom datasets using ssdMobileNet-V1 using Tensorflow-gpu 1.15. I am facing below issues for the same.

Relying on driver to perform ptx compilation. This message will be only logged once.
2022-02-04 10:42:44.152817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
INFO:tensorflow:Saving checkpoint to path train/model.ckpt
I0204 10:43:59.338148 140653890541312 supervisor.py:1117] Saving checkpoint to path train/model.ckpt
INFO:tensorflow:Recording summary at step 0.
I0204 10:44:34.073986 140653915719424 supervisor.py:1050] Recording summary at step 0.
INFO:tensorflow:Error reported to Coordinator: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean
[[node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean (defined at /home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[FeatureExtractor/MobilenetV1/Conv2d_9_depthwise/BatchNorm/gamma/read/_1521]]
(1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean
[[node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean (defined at /home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean’:
File “train.py”, line 186, in
tf.app.run()
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py”, line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File “/home/mlai/.local/lib/python3.7/site-packages/absl/app.py”, line 303, in run
_run_main(main, args)
File “/home/mlai/.local/lib/python3.7/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py”, line 324, in new_func
return func(*args, **kwargs)
File “train.py”, line 182, in main
graph_hook_fn=graph_rewriter_fn)
File “/home/mlai/.local/lib/python3.7/site-packages/object_detection/legacy/trainer.py”, line 353, in train
model_var.op.name, model_var))
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/summary/summary.py”, line 179, in histogram
tag=tag, values=values, name=scope)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_logging_ops.py”, line 329, in histogram_summary
“HistogramSummary”, tag=tag, values=values, name=name)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()
Traceback (most recent call last):
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/mlai/.local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean
[[{{node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean}}]]
[[FeatureExtractor/MobilenetV1/Conv2d_9_depthwise/BatchNorm/gamma/read/_1521]]
(1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean
[[{{node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise/BatchNorm/moving_mean}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

How can I solve this issue?

Hi @abhishek_kumar1

Welcome to the TensorFlow Forum!

Could you please share minimal reproducible code to replicate the error and understand the issue? Thank you.