Grad-cam cnn vs hybrid-swin transformer

I was checking the grad-cam of a pure cnn and a hybrid model (cnn+swin_transformer). Now, after passing an intermediate layer from CNN to Swin-transformer, it looks like the transformer blocks are able to refine the feature activation globally across the relevant object; unlike CNN which is more interested to operate locally.

(left: input, – middle: CNN, – right: CNN + Transformer / Hybrid).

Code example: TF: Hybrid EfficientNet Swin-Transformer : GradCAM | Kaggle


Nice, It could be interesting to visualize also:

1 Like

P.s. see also:


Thanks for sharing this info.
The paper that was explained in the video is super interesting. (printout :bookmark_tabs:)

Ref. Do Vision Transformers See Like Convolutional Neural