Instead of a single word, is it possible to generate an automatic text analysis of an image based on thousands of examples?

Hi everyone!

My name is Diego, and I have some experience with Python, but I’m new to A.I., machine learning and Tensorflow. For now, I have only executed some basic examples on image classification.

Let me explain what I would like to achieve and see if you could give me some guidance on what to study so I can go further.

On the basic image classification examples I’ve seen, we load some datasets (eg.: flowers) with their labels, train the model, and test the model loading a new image so it automatically classifies it with a certain level of accuracy (eg.: ‘roses’, ‘tulips’, ‘sunflowers’, etc)

What I would like to know is:

Supose I have thousands of images of flowers. And for each image, I have an analysis (made by some expert) of that particular flower, in text format, like: "this is a pink flower, it looks beautiful, it is small… etc etc " (each analysis for every flower image has more or less a thousand words).

The question I have is:

Would it be possible to train a model to generate an automatic analysis of flowers based on the thousands of flower images and their correspondent thousands of analysis that I have here?

My intention is to load an image of a new flower (I’m just using flowers as an example) and generate an automatic text analysis of that flower.

Would that be possible? If positive, what would you suggest for me to study? (tools, modules, libraries, etc). Sorry if this is too basic, this ML field is so vast that I feel a little bit lost sometimes.

Thanks in advance!


Hi @Diego_Souza, You can do this by using image captioning. In this method Features are extracted from the image, and passed to the cross-attention layers of the Transformer-decoder. By inspecting the attention weights of the cross attention layers model generates words. For implementation please refer to this tutorial. Thank You.