Improving Dataflow Pipelines for Text Data Processing

Sayak_Paul · March 3, 2022, 10:23am

This is something we (@anon1529149 and I) worked on at Carted. Improving text data processing at scale with Beam and Cloud Dataflow.

Blog post:

Code:

We use some tools from the TensorFlow ecosystem such as a BERT model from TensorFlow Hub, TFRecords for serializing the preprocessed data, etc. I hope this will be really beneficial for the community as with these techniques we were able to reduce the total wall-clock time from more than 3 days to under 3 hours.

We further optimized the BERT model we used in the blog post with ONNX (since we run with CPUs) and the pipeline total takes around 1 hr 45 mins now.

lgusm · March 3, 2022, 11:48am

This is super cool!!! Congrats!

Question: why the last step makes the model better, what’s changed on the model? does it replace ops with optimised ones for CPU?

Sayak_Paul · March 3, 2022, 12:04pm

Do you mean the ONNX conversion step? If so, then it is because ONNX performs layer fusion, replaces layers producing constant values, etc. It simplifies the model graph and hence the latency gets reduced.

lgusm · March 3, 2022, 2:23pm

yes, it was the ONNX conversion step, thanks!