Compact Convolutional Transformers

How does a commoner train a Transformers-based model on small and medium datasets like CIFAR-10, ImageNet-1k and still attain competition results? What if they don’t have the luxury of using a modern GPU cluster or TPUs?

You use Compact Convolutional Transformers (CCT). In this example, I walk you through the concept of CCTs and also present their implementation in Keras demonstrating their performance on CIFAR-10:

A traditional ViT model would take about 4 Million parameters to get to 78% on CIFAR-10 with 100 epochs. CCTs would take 30 epoch and 0.4 Million parameters to get there :slight_smile:


Well done Sayak!!! you’re on fire!!

1 Like

Thanks, Gus! Happy to also collaborate with the Hub team to train this on ImageNet-1k and democratize its use.

1 Like

That would be great!

This is great, will check it out soon :slight_smile: Thanks for sharing! ^^

1 Like

As usual, Sayak writes tutorials faster than I can read them :sweat_smile:.

This is not the implementation I would expect for stocastic-depth. Isn’t this more more like … dropout with a funny noise-shape . I expected:

class Stocastic(keras.Layer):
  def __init__(self, wrapped, drop_prob, *args, **kwargs):
    super().__init(*args, **kwargs)
    self.wrapped = wrapped
    self.drop_prob = drop_prob

  def call(self, x):
    if tf.random.uniform(shape=()) > self.drop_prob:
      x = self.wrapped(x)

    return x

Otherwise you don’t get all the training speed improvements they talk about in the stocastic-depth paper since you run the layer anyway. Also the dropout-like implementation… this only works to kill the output before the add on a residual branch, right? Does the dropout 1/keep_prob scaling still make sense used like this?

1 Like

You are right. This implementation of Stochastic Depth is only cutting the outputs at the residual block.

No, I am not scaling the dropped out blocks with the inverse dropout probabilities. They are simply not dropped during inference.

Is wrapped the block we are applying?

1 Like

Sorry I jumped straight to the criticism, I am a big fan, keep up the good work.

No, I am not scaling the dropped out blocks with the inverse dropout probabilities. They are simply not dropped during inference.

Right, but it is being applied during training and it’s the difference between training and inference that I wonder about.

My intuition for why dropout uses that scaling factor is so that the mean value of the feature is the same before and after the dropout. But each example is independent, they don’t share statistics. In training the next layer sees layer(x)/keep_prob when the layer is kept, and 0 otherwise. No mixing. So the average across samples is preserved, but maybe the average value seen for any sample is not realistic.

And I see slightly better results (just 1 run) after dropping that factor:


Epoch 28/30
352/352 [==============================] - 13s 37ms/step - loss: 0.9469 - accuracy: 0.8020 - top-5-accuracy: 0.9899 - val_loss: 1.0130 - val_accuracy: 0.7766 - val_top-5-accuracy: 0.9870
Epoch 29/30
352/352 [==============================] - 13s 36ms/step - loss: 0.9326 - accuracy: 0.8079 - top-5-accuracy: 0.9901 - val_loss: 1.0455 - val_accuracy: 0.7674 - val_top-5-accuracy: 0.9844
Epoch 30/30
352/352 [==============================] - 13s 37ms/step - loss: 0.9296 - accuracy: 0.8097 - top-5-accuracy: 0.9902 - val_loss: 0.9982 - val_accuracy: 0.7822 - val_top-5-accuracy: 0.9838
313/313 [==============================] - 2s 8ms/step - loss: 1.0239 - accuracy: 0.7758 - top-5-accuracy: 0.9837
Test accuracy: 77.58%
Test top 5 accuracy: 98.37%


Epoch 28/30
352/352 [==============================] - 13s 37ms/step - loss: 0.9268 - accuracy: 0.8117 - top-5-accuracy: 0.9908 - val_loss: 0.9599 - val_accuracy: 0.8050 - val_top-5-accuracy: 0.9872
Epoch 29/30
352/352 [==============================] - 13s 37ms/step - loss: 0.9255 - accuracy: 0.8136 - top-5-accuracy: 0.9910 - val_loss: 0.9751 - val_accuracy: 0.7942 - val_top-5-accuracy: 0.9868
Epoch 30/30
352/352 [==============================] - 13s 37ms/step - loss: 0.9132 - accuracy: 0.8181 - top-5-accuracy: 0.9923 - val_loss: 0.9745 - val_accuracy: 0.7952 - val_top-5-accuracy: 0.9870
313/313 [==============================] - 3s 9ms/step - loss: 0.9976 - accuracy: 0.7867 - top-5-accuracy: 0.9855
Test accuracy: 78.67%
Test top 5 accuracy: 98.55%

Interesting. IIUC, here’s that factor in the code from the original paper, so maybe I’m wrong:

Is wrapped the block we are applying?

Yes, that’s what I meant here that this layer is like “maybe apply the wrapped layer”, (It should also check the training flag…)

Either way, fun stuff.

Thanks again Sayak!

1 Like

I think this is the need of the hour. Please keep’em coming. Thanks very much for the resources. If you feel like you could add those to the example, feel free to submit a PR (maybe)? I’d be more than happy to give you co-author credits.

I think for single domain learning tasks like vanilla image classification batch stats are fine.

Which factor? Could you provide a short snippet?

1 Like

Yes, absolutely.

But I guess the main thing is just that it turns out I just don’t really understand stochastic depth.

Your code and the original implementation are doing the same thing. And now I’m just trying to understand why.


    if self.train then
      if self.gate then -- only compute convolutional output when gate is open


        if training:
            keep_prob = 1 - self.drop_prob
            shape = (tf.shape(x)[0],) + (1,) * (len(tf.shape(x)) - 1)
            random_tensor = keep_prob + tf.random.uniform(shape, 0, 1)
            random_tensor = tf.floor(random_tensor)
            return (x / keep_prob) * random_tensor
        return x

The part I don’t understand is the keep_prob scaling.

      # inference branch
     # Training branch
      return (x / keep_prob) * random_tensor

These are equivalent, but I don’t understand why this line is there.

I understand the argument for this scaling in dropout: “Show me half the pixels twice as bright during training and then all the pixels for inference.”

But I’m less comfortable with applying thos logic to the entire example. “Skip the operation or do it twice as hard, and for inference do it with regular strength.”

But maybe I can understand it with the “the layers of a resnet are a like gradient vector field pushing the embedding towards the answer” interpretation. I guess If I’m taking fewer steps each could be larger.

Which factor? Could you provide a short snippet?

My little experiment was, in your code, to just replace this line:

return (x / keep_prob) * random_tensor


return x * random_tensor

I’ll run it a few more times and see what happens.


I’ll run it a few more times and see what happens.

It looks like the difference between those two runs was not important. Validation accuracy seems to come out anywhere from 76-79%.


Right on.

Thanks so much, @markdaoust for the conversation and for looking into this.


Thanks for sharing the code!

Did you ever try to replicate CVT with the official version’s parameters?

When I tried to do so, I could not match the model size. To me it appeared that their version of Multi-Head Attention used less than half of the parameters of the TF version.

Any thoughts?

Yup. I know about this.

Check out the discussion on the PR:

If you use Hugging Face’s implementation of attention the number of parameters can be further reduced. Or even better – timm's implementation of attention.

1 Like


Alihassanijr’s comments perfectly align with what I saw.

I hope the suggested workaround plays out well for you :slight_smile: