When trying to ingest data into TFX with the following code below.
It takes to long: 15 files with 277MB takes 13 minutes on a virtual machine with decent CPU and 16GB of RAM (no GPU).
Link to data used: https://github.com/petrobras/3W/tree/main/dataset/7
from tfx.v1.components import CsvExampleGen
from tfx.v1.dsl import Pipeline
from tfx.v1.orchestration import metadata
from tfx.v1.orchestration import LocalDagRunner
import pathlib
test_input_dir = pathlib.Path('data/3w_repo/dataset/7/')
example_gen = CsvExampleGen(input_base=str(test_input_dir))
components = [
example_gen,
#statistics_gen,
#schema_gen,
]
pipeline = Pipeline(
pipeline_name="pipeline_name",
pipeline_root="data/pipeline_test_root",
metadata_connection_config=metadata
.sqlite_metadata_connection_config("data/metadata_path"),
components=components
)
LocalDagRunner().run(pipeline)
I also attempted to use the FileBasedExampleGen on the same data (converted to .parquet), and got similar absurd processing times just for ingesting the data.
My total dataset size is 4.8GB, so at this rate, just loading data using TFX would take about 3.5 hours.
Am I making errors in the code to ingest this data?
Are there ways to make this speedier?