How to stop and resume object detector training(object detection model maker)

Ugur_Ercelik · August 5, 2021, 8:14am

I am working on object detection with autonomous datasets . I want to train my model with 10000 train images,2000 test,2000 validation images.So, i will use object detection tensorflow lite model maker.

Project link : Object Detection with TensorFlow Lite Model Maker

But the large dataset and batch size of 32, the training takes 50 epochs and takes 2 days(Step 3).I can’t keep my computer on for two days.I am running the project in jupyter notebook

How can i stop model training and again resume it ? (e.g. stop the 10th epoch and continue one day later)

lgusm · August 5, 2021, 10:02am

Hi Ugur,

I don’t think you can do that. Model Maker, as of today, doesn’t have a stop and resume option.
you have a couple of options:

make sure you’re using a GPU for training. This makes a huge difference in execution time
run the same notebook on the cloud (eg: GCP) with a higher spec machine. This way you can keep the machine turned on during the process. The drawback is that you have to pay

Object Detection is a complex task and it’s expected that it would take a long time to finish, even with top HW spec.

@Yuqi_Li any other suggestion?

Viktor_Nilsson · November 30, 2021, 12:20pm

Do you have any plans to introduce support for resuming training from a mode previously trained/created using TFLiteModelMaker?
I often have a situation where training data is acquired continuously from existing camera installations. It would be a great feature to be able to use a previously trained model as baseline when continuing the training with more and new data.

For example an option to pass the path to an existing checkpoint when calling tflite_model_makerobject_detector.create() ?

Thanks!

Robert_Zak · December 23, 2021, 4:04am

Is there any update on ability of Model Maker, as featured in EfficientDet Tutorial to resume from a checkpoint? I notice that the current version of EfficientDetLiteXSpec() takes an argument for a “model_dir” . When set, object_detector.create() dutifully records checkpoints as it is training. Is there any other use for these checkpoints (other than resuming from a checkpoint)?

Viktor_Nilsson · January 17, 2022, 9:03am

I made a workaround to allow resuming from a checkpoint saved in model_dir by manually calling tf.keras.models.load_weights({checkpoint_path}) on the model before starting to train again.

The quickest way if you want to try it is to install TFLiteModelMaker as source in pip and add:
model.load_weights({checkpoint_path}), in the train() function, just before the call to model.fit() in object_detector_spec.py

lgusm · January 17, 2022, 7:13pm

Hi Viktor,

Can you send a PR with this change? I think that other people might benefit from it!

Fredrik_T · January 18, 2022, 6:45am

This sounds like a quick thing to get implemented officially.
Great finding!

Viktor_Nilsson · January 18, 2022, 9:56am

Hi,
Here is a PR where a checkpoint can be passed to objectdetector.create() for resuming training.

lgusm · January 18, 2022, 11:50am

Thanks Viktor! Many people will be happy when this is merged! (given previous threads asking the same thing!)