I was checking the grad-cam of a pure cnn and a hybrid model (cnn+swin_transformer). Now, after passing an intermediate layer from CNN to Swin-transformer, it looks like the transformer blocks are able to refine the feature activation globally across the relevant object; unlike CNN which is more interested to operate locally.
(left: input, – middle: CNN, – right: CNN + Transformer / Hybrid).
Code example: TF: Hybrid EfficientNet Swin-Transformer : GradCAM | Kaggle