Developer workflow for tensorflow/tfx

Hello,

I’m trying to improve my workflow for supervised learning in the tf ecosystem, and was wondering if anyone has encountered/solved this problem.

I’m able to get a fully functional pipeline to execute in LocalDagRunner, however, because of this particular runner being designed for toy examples (from a memory perspective), I hit the limits that I wouldn’t otherwise hit with the design of keras/beam. So I’m trying to move to other Dag Runners, with the idea of doing some amount of local development, and then switching to vertex when I need big scale.

With that in mind, I know Kubeflow just came out with a major release (v2). Running a tfx pipeline requires persistent storage that’s accessible across the components. How is this problem solved in Kubeflow? I tried setting up minio but ran into some issues.

Looking forward to any advice anyone might have. Thank you!

Pritam

maybe @Robert_Crowe can share some tips here

The problem is the need for persistent storage that’s accessible across the components?

I don’t know how Kubeflow might solve that, but it seems fairly basic so I’m not sure that they would see it as a problem. I’d check the Kubeflow docs to see if there is any mention of that.

@pritamdodeja
Check this out
tfx/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_local.py at 07f5abbff0d1ebbc8c837043d0284ec63f7b7939 · tensorflow/tfx (github.com)

Thank you for this @Aditya_Soni! I’ll try and get this to work and report back here.

Hi @pritamdodeja

Transitioning from LocalDagRunner to a more scalable runner is a common step when moving from development to production in the TFX ecosystem. You’re on the right track considering other Dag Runners for scalability.

Regarding Kubeflow v2, it indeed offers enhanced capabilities and integration with TFX. For the persistent storage that’s accessible across TFX components in Kubeflow:

  1. PVC (Persistent Volume Claims): Kubeflow Pipelines with TFX often use Kubernetes’ PVCs to provide a shared storage space for all pipeline components. This ensures data consistency and accessibility across different stages of the pipeline.

  2. MinIO: It’s a popular choice for local development as it provides an S3-compatible storage backend. If you’re facing issues with MinIO, could you specify the problems? It might be related to configuration or access permissions.

  3. Cloud Storage: When switching to Vertex AI or other cloud platforms, you can leverage their native storage solutions, like Google Cloud Storage (GCS) for Vertex AI. This provides a seamless and scalable storage backend for TFX pipelines.

  4. Kubeflow Metadata Store: The metadata store in Kubeflow can be backed by MySQL or SQLite, ensuring persistent storage of metadata across runs.

For a smoother transition:

  • Ensure that your TFX components are designed to read and write from the specified storage backend.
  • Test the pipeline’s connectivity and permissions to the storage backend in the initial stages to avoid potential issues later on.

I hope this provides some clarity. If you can share more specific issues you faced with MinIO or any other part of the setup, the community might be able to provide more targeted advice.

Good luck!

@Elgridge Thank you very much for this, I sincerely appreciate the detail and thought you have put in. I’m traveling for work at the moment, I will provide more detailed context once I am better positioned. Thank you!

Regards,

Pritam

@Elgridge
I am facing the issue while working with minio

My csc file is stored in a minio location in folder name " data" as ,

" minio://mlpipeline/data "

The above string I am providing as data root to csv example gen

While compiling the pipeline with kubeflow dag runner, am facing the error as

File system 'minio' not implemented