Strategy for high-throughput skipgrams?

I’m trying to test out hash embeddings ( on skipgrams and while I’m searching for parameters it seems like my data pipeline is my biggest bottleneck.

I’ve tried computing skipgrams on the fly which is quite slow, and also writing positive pairs to TFRecords, computing NCE negative samples on the fly but this is very storage-intensive and requires a lot of preprocessing.

What’s the current state of the art for producing skipgrams to keep up with an embedding training?

1 Like