When to use 'detection' or 'full' when training object detection models?

Lets say I train a pretrained network like ResNet and set it to detection in the pipeline.config file for the fine_tune_checkpoint_type attribute. As far as I understand this means that we take the pretrained weights of the model expect for the classification and box prediction heads. Further this means that we can create our own type of labels which will then results as the classification and box prediction heads for the model we want to create/train.

Now, let’s say I train this network for 25000 steps and want to continue training later on without the model forgetting anything. Should I change the fine_tune_checkpoint_type in the pipeline.config to full in order to continue the training (and of course load the correct checkpoint file) or should I still let it be set as detection?

This is based on the information found here train.proto:

  //   1. "classification": Restores only the classification backbone part of
  //        the feature extractor. This option is typically used when you want
  //        to train a detection model starting from a pre-trained image
  //        classification model, e.g. a ResNet model pre-trained on ImageNet.
  //   2. "detection": Restores the entire feature extractor. The only parts
  //        of the full detection model that are not restored are the box and
  //        class prediction heads. This option is typically used when you want
  //        to use a pre-trained detection model and train on a new dataset or
  //        task which requires different box and class prediction heads.
  //   3. "full": Restores the entire detection model, including the
  //        feature extractor, its classification backbone, and the prediction
  //        heads. This option should only be used when the pre-training and
  //        fine-tuning tasks are the same. Otherwise, the model's parameters
  //        may have incompatible shapes, which will cause errors when
  //        attempting to restore the checkpoint.

So, the classification only provides the classification backbone part of the feature extractor. This means that the model will start from scratch on many parts of the network.

detection restores the whole feature extractor but the “end result” will be forgotten, which means we can add our own classes and start learning these classifications from scratch.

full restores everything, even the classes and box prediction weights. However, this is fine as long as we do not add or remove any classes/labels.

Is this correct?

Thanks!

Hi @TensorOverflow ,

Yes, your understanding is quite accurate.

In your scenario, where you have trained for 25,000 steps and want to continue training without the model forgetting anything, you should generally stick to the same fine_tune_checkpoint_type that you used during the initial training.

  • If you used “detection” initially, continue with “detection” for subsequent training.
  • If you used “full,” continue with “full.”

Switching to “full” might lead to compatibility issues if you have introduced changes to the classes or labels, as the model’s parameters need to align. Therefore, it’s usually safer to stay consistent with the same type you used initially, allowing the model to build on the knowledge gained during the initial training while adapting to new tasks.

Thanks.