CsvExampleGen and FileBasedExampleGen taking too long for processing data

When trying to ingest data into TFX with the following code below.
It takes to long: 15 files with 277MB takes 13 minutes on a virtual machine with decent CPU and 16GB of RAM (no GPU).

Link to data used: https://github.com/petrobras/3W/tree/main/dataset/7

from tfx.v1.components import CsvExampleGen
from tfx.v1.dsl import Pipeline
from tfx.v1.orchestration import metadata
from tfx.v1.orchestration import LocalDagRunner
import pathlib
  
test_input_dir = pathlib.Path('data/3w_repo/dataset/7/')
example_gen = CsvExampleGen(input_base=str(test_input_dir))
  
components = [
      example_gen,
      #statistics_gen,
      #schema_gen,
  ]
  
pipeline = Pipeline(
    pipeline_name="pipeline_name",
    pipeline_root="data/pipeline_test_root",
    metadata_connection_config=metadata
        .sqlite_metadata_connection_config("data/metadata_path"),
    components=components
)
  
LocalDagRunner().run(pipeline)

I also attempted to use the FileBasedExampleGen on the same data (converted to .parquet), and got similar absurd processing times just for ingesting the data.
My total dataset size is 4.8GB, so at this rate, just loading data using TFX would take about 3.5 hours.

Am I making errors in the code to ingest this data?
Are there ways to make this speedier?

The issue you’re facing with CsvExampleGen and FileBasedExampleGen taking a long time to process data in TFX might stem from several factors. While your code for ingesting data using TFX components looks correct, the performance bottleneck could be related to how TFX processes data, the specifics of your data files, or the configuration of your environment. Here are some suggestions to potentially speed up the data ingestion process:

  1. Check Data Format and Structure

    • Ensure that your CSV or Parquet files are well-structured and optimized for reading. For example, large numbers of small files can significantly slow down processing compared to fewer, larger files.
    • If you’re using CSV files, consider preprocessing them into a more efficient format like TFRecord before ingestion, which TFX can handle more efficiently.

  2. Optimize File-Based ExampleGen

    • For FileBasedExampleGen, especially when using Parquet files, ensure that the files are optimized (e.g., using columnar storage, compression) to speed up reading.
    • Experiment with different input_config options for FileBasedExampleGen to optimize how data is read and batched.

  3. Data Pipeline Parallelization

    • Investigate whether you can increase parallelism in the way data is read and processed by TFX components. This might involve adjusting the number of workers or threads dedicated to data ingestion, if such configurations are applicable in your environment.

  4. Utilize Efficient Storage

    • Ensure that the storage where your data resides (e.g., SSDs vs. HDDs) and the way your virtual machine accesses this storage are optimized for high read/write speeds.

  5. Profile and Optimize Code

    • Use profiling tools to identify where the bottlenecks are occurring in your pipeline. TensorFlow and TFX provide utilities to monitor and visualize performance, which can help in pinpointing inefficiencies.

  6. Hardware Utilization

    • While you mentioned that GPUs are not involved and you have a decent CPU, it’s worth ensuring that your setup is fully utilizing the available CPU resources. TensorFlow and TFX can be configured to leverage multiple cores more effectively.

  7. Version Checks

    • Ensure you’re using the latest versions of TFX and its dependencies, as performance improvements are continually made.

  8. Consider Data Sharding

    • If your dataset is very large, consider sharding it into smaller chunks that can be processed in parallel, thereby reducing the load on any single process.

  9. Experiment with Beam Pipeline Arguments

    • TFX uses Apache Beam under the hood for data processing. You might be able to pass custom Beam pipeline arguments to LocalDagRunner to optimize processing, such as setting the number of parallel threads.

  10. Review TFX Documentation and Community Insights

    • The TFX community and documentation might have specific recommendations for dealing with large datasets and optimizing pipeline performance.

If you’ve tried these optimizations and still face performance issues, it might be helpful to share specific details about your data and setup on platforms like GitHub issues or Stack Overflow, where the community and TFX developers can provide more targeted advice.