Image similarity with efficientnet

I’m using the efficientnet model and the 1280 vector output for image similarity. It’s working great and from testing is the best model I’ve found with the amount of data that is used.
Sometimes it just does weird things though.
Here are the image in question: Imgur: The magic of the Internet
The first image is the input, the 2nd which should be found and the third one which is actually found as closest.

I’m using a compare script and these are the results of said images:
input against image that should be found (img1 vs img2)

features shape: (1280,)
euclidean: 0.839167594909668
hamming: 1.0
chebyshev: 0.14557853
correlation: 0.36508870124816895
cosine: 0.35210108757019043
minkowski: 0.839167594909668

input against image that is incorrectly returned as closest (img1 vs img3

features shape: (1280,)
euclidean: 0.7945413589477539
hamming: 1.0
chebyshev: 0.11684865
correlation: 0.32784974575042725
cosine: 0.3156479597091675
minkowski: 0.7945413589477539

I’m not understanding how img3 can be closer to the input than the other. It’s working most of the time and this is a weird outlier.
Any ideas how this can be solved?


Was It trained only on your own dataset?

No I’m using pre-trained weights. I would train it but I’m not sure what’s the correct course and the benefit. The image database is only filled with one class, namely stamps, so any classification goes out the window.

You can still use your dataset for metric learning or embedding finetuning. You can use the backbone that you want (also efficientnet).


Thank you very much for your suggestions!
I’ve stumbled upon metric learning before but it went over my head how to implement it in my case. I’ve figured it out now, wasn’t too difficult once it clicked with the link you provided and the tests I’ve done show good results. The distance is much much closer now. Sometimes nearly 0 which is a big win!

I’m still struggling to understand what’s really going on and how this works. From my understanding every layer has weights that can be tuned, in fine-tuning you freeze most of the pre-trained weights. I’m not freezing any right now and I wonder if that’s the best thing to do.
From my intuition I’d freeze every weight but the last one I use, the avg_pool but this one and those before don’t have much weights. I fear I skew the weights too much with my limited data set.
Any suggestions on this or do you think it’s alright?

For reference, I build the model like this:

model = tf.keras.applications.EfficientNetB0(include_top=False, weights=‘imagenet’, pooling=‘avg’)
embeddings = tf.nn.l2_normalize(model.output, axis=-1)
metric_model = EmbeddingModel(model.input, embeddings)

If your dataset is really quite small you could be in a few shot domain.

In that case you can take look at some tips in:

Well it’s not really a small set. I’ll explain:

The goal is to take a camera shot of a real stamp and find a correct lookup in a database that consists of 350k+ unique stamp images.
Most of the results are pretty accurate and good, it could be better though, that’s why I’m looking to further train the model.
For a lot of stamps I have real camera photos, 10-20 or more which I could compile but:
With the metric learning I’m running into the problem that a label is expected and as I technically have over 350k labels I don’t know how to deal with that. In practice, I think metric learning is the right solution but I’m a little stuck right now.

If you need to retrieve the image from a real camera picture and you don’t have to much real camera images in the wild in your training dateset you will probably need to care about augmentations.

E.g. See RandAugmentation in:

How should I handle 350k+ classifications? Overall I would need at least 1 million because that’s roughly the count of unique stamps that exist.
If I try with 1 million labels I run into OOM errors. This conceptual problem keeps me from doing any meaningful training.

For the augmentation i was only talking about this:

For a lot of stamps I have real camera photos, 10-20

If you are really going to have a large scale classification problem it is going to be quite similar to the large scale face recognition proposed solutions.

E.g. Glint360K dataset has 360k identities so quite similar to your 350k+ but with other tricks you can scale also to 10M 20M 30M 100M. See:

I see that you are using ImageNet Pre-Trained weights of EfficientNet for Feature Extraction.

From what I see, either or both of the following issues may be contributing to the weird result:

  • The Keras official implementation of EfficinetNet Expects Un-Normalized Inputs in the range of 0-255. So, in case you are normalizing the input images before feeding them into the network, it may lead to issues.

Quote from Documentation:
EfficientNet models expect their inputs to be float tensors of pixels with values in the [0-255] range.
Source: Keras EfficientNet Documentation

  • The alternative issue (and most likely one) is since the network was pre-trained on ImageNet, which does not contain examples similar to your query and target image, it is possible that the feature map for both these images is the same and/or similar, which is leading to the error in distance calculations. A solution, in this case, would be to train/fine-tune your model on your relevant dataset to get a more relevant feature vector.