How to serialize/deserialize a object capturing state

I am looking for a way to serialize/deserialize a object in a way that captures the state without computing the pipeline.

A straightforward way to serialize a would be to call the save method, then derserialize with load, but saving like this is not exactly serializing. Calling save forces a compute so any map/filter/etc. methods in the pipeline are called. I’d like to be able to store the state of the Dataset to disk so another process can load it later and have the state identical to the state at the time of serialization.

Maybe iterator checkpointing is the approach I should try?

I found this github issue on the topic but no solution to the original issue.

Thanks for any help.


The more I am reading about the more I think my original goal is not practical. The original pipeline may involve python functions used in transformation fuction (map, filter, etc.) that will not necessarily be available on another process reading the serialized Dataset.

I think the best options would be to simply use save and load or checkpoint. It’s not really achieving my original goal, but I don’t think that goal is realistic.

I’d be interested to hear from others with their thoughts on this.