Word Embeddings Not Accurate

aiman_shivani · July 16, 2021, 5:08am

I am trying to build my own word2vec model using the code provided here
Link: - word2vec | TensorFlow Core

So i have even tried to increase the data as well for training the word embedding and i am able to achieve a good model accuracy but when i plot the word vectors on the Embedding Projector the distance between words or the word similarity is really bad, if i even use the cosine distance formula between very similar words the result is bad.

Whereas if the same data is used to train own embeddings using the Gensim library ( not pre-trained) the results of distance and similarity are way better, even on the Embedding Projector as well.

Please can someone help me regarding this, i want to use the Word2Vec code only which is provided by TensorFlow but i am not able to get good results for word distance and word similarity.

Sayak_Paul · July 17, 2021, 1:56am

Could there be a problem in how you are serializing the embedding vectors and the associated words?

Also, can you confirm there is no difference in the hyperparameters that you are using in TensorFlow and Genism?

aiman_shivani · July 17, 2021, 6:56am

I am sure regarding the serializing of the vectors and associated word. But regarding the hyper-parameters, i have tried my best to use the same but gensim model is like a blackbox so with just one sentence of code i get the entire word vector array, surely there might be some changes in the code or the processing part but the Tensorflow model is giving no result at all.

aiman_shivani · July 17, 2021, 8:33am

Also there has been an issue raised before on the Github Repository of Tensorflow but it doesn’t seem to have been solved.
Issue Raised: - The word vector obtained by the word2vec tutorial is very bad · Issue #50645 · tensorflow/tensorflow · GitHub

Khushboo_Gupta · July 17, 2021, 5:23pm

Hi @aiman_shivani . Sorry, that was posted by mistake. It was not related to the tutorial code.

Sayak_Paul · July 19, 2021, 3:13am

/Cc: @markdaoust may be Mark can shed some additional light.