Strategy for high-throughput skipgrams?

I’m trying to test out hash embeddings (https://arxiv.org/abs/1709.03933) on skipgrams and while I’m searching for parameters it seems like my data pipeline is my biggest bottleneck.

I’ve tried computing skipgrams on the fly which is quite slow, and also writing positive pairs to TFRecords, computing NCE negative samples on the fly but this is very storage-intensive and requires a lot of preprocessing.

What’s the current state of the art for producing skipgrams to keep up with an embedding training?

1 Like

Hi @Patrick_McCarthy

Welcome to the TensorFlow Forum!

Please have a look at this Word2vec for word embeddings using skipgrams from large datasets and can use tf.data.Dataset api for the dataset performance. Thank you.