How does an embedding layer cluster similar words?

TimoKer · July 22, 2021, 11:24am

Hi there!

I am aware what the embedding process represents, turning an integer that represent a unique word into an n-dimensional vector, and this for each word in the vocab. However, I am trying to understand how it actually clusters words using the gradient descent? For similar words to be clustered, the distance between two words (in vectorspace) should contribute to the error of the classification model (which proxies the “similarity”, or clustering criteria). At this point, I don’t see how this is the case.

Can someone help me along the way?
Thanks

EDIT: Could you think of it as the network changing the embedding weights, such that the resulting inputs for the subsequent dense layers better reflect the similarity between words of the same “class”? Or in other words, that the network trains to make the input conform better to its classification task?

lgusm · July 23, 2021, 5:17pm

Hi Timo,

I think this video can give you some insights: Transfer learning and Transformer models (ML Tech Talks) - YouTube

TimoKer · August 6, 2021, 8:50am

Thanks Igusm,
It’s clear now, and as I thought. Fascinating!

PS: sorry for the late reply