It is a common practice to use the same input image resolution while training and testing vision models. However, as investigated in Fixing the train-test resolution discrepancy (Touvron et al.), this practice leads to suboptimal performance. Data augmentation is an indispensable part of the training process of deep neural networks. For vision models, we typically use random resized crops during training and center crops during inference. This introduces a discrepancy in the object sizes seen during training and inference. As shown by Touvron et al., if we can fix this discrepancy, we can significantly boost model performance.
In the following example, we implement the FixRes techniques introduced by Touvron et al. to fix this discrepancy.