Image classification with ConvMixer

What happens when we apply similar pure convolution blocks on patches of images? We can train a network with 0.8 million parameters for 10 epochs on CIFAR-10 and get ~83% top-1 test accuracy without having to use any fancy regularization. ConvMixer (the recently talked about architecture on Twitter):

There are a few visualizations of the internals of ConvMixer that might be useful for the community.

Learned patch embeddings:


Convolution kernel from the middle of the network showing varying locality spans: