Unable to train Mask R-CNN - checkpoint version conflict

John_Phillip · June 12, 2023, 2:22am

Describe the bug
While using TensorFlow Object Detection API, I’m experiencing an issue with a pre-trained Mask R-CNN Inception ResNet V2 1024x1024 model. When attempting to fine-tune this model for my custom task, I receive an error regarding missing variables even though the specified checkpoint seems to contain the appropriate parameters for this model.

To Reproduce
Steps to reproduce the behavior:

Download the pre-trained Mask R-CNN Inception ResNet V2 1024x1024 model from the TensorFlow Model Zoo.
Set up a custom training pipeline configuration, specifying the path to the downloaded checkpoint in the fine_tune_checkpoint field.
Run the model training script (model_main_tf2.py).
The error appears indicating some variables from the checkpoint are not found in the model.

Traceback (most recent call last): File "/content/models/research/object_detection/model_main_tf2.py", line 114, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/app.py", line 36, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/models/research/object_detection/model_main_tf2.py", line 105, in main model_lib_v2.train_loop( File "/usr/local/lib/python3.10/dist-packages/object_detection/model_lib_v2.py", line 605, in train_loop load_fine_tune_checkpoint( File "/usr/local/lib/python3.10/dist-packages/object_detection/model_lib_v2.py", line 398, in load_fine_tune_checkpoint raise ValueError('Checkpoint version should be V2') ValueError: Checkpoint version should be V2

Expected behavior
I expect the model training to begin by loading weights from the specified pre-trained model. The error seems to suggest a mismatch between the model architecture defined in my pipeline and the architecture of the pre-trained model. Still, my pipeline configuration appears to be correctly set up for the Mask R-CNN Inception ResNet V2 1024x1024 model.

Desktop (please complete the following information):

OS: MacOS 13.4 (22F66)
Browser Safari
Version 16.5 (18615.2.9.11.4)

N.B: I am using Google Colab Pro

Additional context

Upon inspecting the checkpoint file with inspect_checkpoint.py, it does appear to contain all the expected variables for a Mask R-CNN Inception ResNet V2 1024x1024 model. I also confirmed that the downloaded files include ckpt-0.index, ckpt-0.data-00000-of-00001, and checkpoint. Yet, the issue persists. Any guidance or solutions to this problem would be greatly appreciated.

Laxma_Reddy_Patlolla · June 12, 2023, 9:03pm

Hi @John_Phillip ,

I can see that you are using a research model for your object detection training, those models contain some deprecated lines of code. Tensorflow does not officially support research models.

I recommend you to use the official tensorflow/models for your use case. Please refer to this instance segmentation with Model Garden using official models and let us know if you facing any errors?

Thank You.