Difference in accuracy in tensorflow and pytorch models

I’ve implemented yolact from scratch in tensorflow. You can find it here. But I’m only getting 25.6 MAP, while the official implantation is getting 28.2 MAP. So, there is a difference of 2.6 MAP. Here are some of the things I’ve found:

  1. L2 regularization was important to train the model (without it I was getting 0 MAP)
  2. Including variance in bounding boxes (0.1 in center and 0.2 in width) improved the MAP by 2.
  3. Gradient clipping of 10 was important to prevent gradient getting exploded.
  4. Only SGD with 0.9 momentum was able to converge the model. (ADAM didn’t work)
  5. Piecewise constant learning rate gave faster convergence.
  6. Online hard example mining loss for classification was divided by total number of positive and negative samples. This was found to be better than dividing just by number of positive samples.

With these, I’m still not able to match official MAP. The only difference I can think of is the backbone. Pytorch normalizes the image from 0 to 1 and while using pre-trained model, input has to be normalized like that only. While tensorflow does not normalize image between 0 and 1, it just converted from RGB to BGR, then each color channel is zero-centered with respect to the ImageNet dataset, without scaling.

Has anyone faced a similar problem while training model in tensorflow ? If yes, how did you overcome this ?

I’ve also found that official implementation unfreezes the batch normalization, but when I try to do the same, my loss does not converge. How is it possible ?