Tutorials/materials for fast image retrieval tasks

Sayak_Paul · June 16, 2021, 10:26am

Hi all.

Looking for resources/materials that build on TensorFlow/Keras (preferably) for doing scalable image similarity searches.

Bhack · June 16, 2021, 11:05am

With Keras you can take a look at:

Investigating the Vision Transformer Model for Image Retrieval Tasks
*DELF tutorial

More in general I suggest you an overview:

https://arxiv.org/abs/2101.11282

Sayak_Paul · June 16, 2021, 11:27am

Thanks @Bhack.

I am aware of the DELF model but I don’t want to go that route, I would rather prefer good old embeddings (within a reduced space with something like random projections), and then applying approximate nearest neighbors.

Thanks for the survey paper, will look into it. Maybe a TFX example on this might be very helpful given the practical relevancy. Cc: @Robert_Crowe

Bhack · June 16, 2021, 12:00pm

When you have created your more or less efficient embedding e.g.:

Then, you can use almost the same pipeline as in text:

Sayak_Paul · June 16, 2021, 1:02pm

Thanks.

I am quite familiar with the Keras examples that you shared. In fact, I myself have one:

All of them demonstrate workflows which is why I wanted to know about solutions that focus on scalability and depth. Thanks for the GCP one.

Bhack · June 16, 2021, 5:38pm

Is that also this domain Is quite large and It has evolved a little bit over the time.

As you can see from the mentioned survey and also in:

https://arxiv.org/abs/2012.00641

ShivamShrirao · June 17, 2021, 12:39am

Well if Tensorflow/keras isn’t a necessity you can try CLIP model by OpenAI. It is quite robust and has good generalizability. You can pre calculate all the images embeddings and calculate cosine similarity with your query image. For huge number of images at large scale you can use Approximate Nearest Neighbor with Hnswlib:

or Faiss:

or Annoy:

CLIP is usually used by encoding the text to same embedding domain as images, but it should work within images itself too since they are in same domain.

Sayak_Paul · June 17, 2021, 1:26am

Thanks for your suggestions. Using ANN has been on my mind from the beginning. After looking for a few other options I also saw NGT which seems to be producing the best results too.

For within image domain pre-training, I would rather prefer the recent self-supervised methods like DINO because they are well formulated and we’ll established. DINO is also particularly good at this task.

Just as an FYI, one of the Keras links above (shared by @Bhack) actually shows you how to implement a CLIP like model minimally. If you haven’t, definitely check that out