I’m trying to implement FasterRCNN in Keras and having enormous difficulty doing so. I believe I am encountering some sort of numerical stability issue. When I try to train only the region proposal network (VGG16 + a couple convolution layers tacked on producing two output maps) on a small subset of images, training progress as measured by recall % is very slow and oscillates quite a bit.
As I enable clipnorm, with values like 8, 4, and 1, behavior improves. I don’t see losses exploding or turning into NaNs in any case and am at a loss as to how to debug the issue.
I suspect the issue must involve my loss functions but they are relatively simple (apart from a lot of reshaping of tensors). What would be a good way to debug this?
I’ve thought of printing out the gradient norm per training sample but am not sure how to obtain the gradient in the first place. Most of the examples I’ve seen online no longer work with TF 2.0.