I am aware what the embedding process represents, turning an integer that represent a unique word into an n-dimensional vector, and this for each word in the vocab. However, I am trying to understand how it actually clusters words using the gradient descent? For similar words to be clustered, the distance between two words (in vectorspace) should contribute to the error of the classification model (which proxies the “similarity”, or clustering criteria). At this point, I don’t see how this is the case.
Can someone help me along the way?
EDIT: Could you think of it as the network changing the embedding weights, such that the resulting inputs for the subsequent dense layers better reflect the similarity between words of the same “class”? Or in other words, that the network trains to make the input conform better to its classification task?