[help_request] Working with large datasets

Malina_Klein · July 6, 2022, 11:45am

Hello all,
I am fairly new to this subject and I have a rather general question.
I am using a very large data set (“wmt19_translate/de-en”, > 38 million data) for a project. It is a translation dataset that contains the languages English and German.
I have already downloaded it and can now load it via tfds.load().
Now my question is how can I best continue working with the loaded dataset without each calculation taking forever.
For example, I need to determine the length for the individual elements. And I want to search for certain words.
Saving the elements in a list and working with them takes a long time.
Should I work with dataframes, datasets or what is the best way to read and process the big data?

Thank you!

Malina