StatisticsGen From examples saved in GCS?

Hello, I’ve run my TFX pipeline on VertexAI and the output from the ExampleGen component has been saved to a location in GCS - is there a way I could load those examples in to an InteractiveContext within a notebook so I can view the Facets output?

For example:

statistics_gen = StatisticsGen(examples="? examples processed and saved in GCS")
context.run(statistics_gen)
context.show(statistics_gen.outputs['statistics'])

Or, is there a way of loading the statistics that have already been emitted by the pipeline?
It looks like the stats were saved to locations:

StatisticsGen/statistics/40/Split-train
StatisticsGen/statistics/40/Split-eval

Could I use the metadata DB for this purpose?

So many questions :slight_smile: Thanks!

The ExampleGen component outputs dataset artifacts, which are TFrecord files. You can import a previously generated artifact by using the ImporterNode. Here’s an example of importing a previously generated schema, but a dataset would be similar.

Hi Robert and thanks for pointing me in the right direction!
This is the code I used to load the generated TFRecords (these were generated when I ran my pipeline locally) and ran the StatisticsGen component over them with an InteractiveContext:

source = os.path.join('/path to eval and train TFRecord files') # for example, it could be /your-pipeline/BigQueryExampleGen/examples/NN
importer = Importer(
    source_uri=source, 
    artifact_type=standard_artifacts.Examples,
    properties={
        'span': 0,
        'split_names': '["train", "eval"]',
        'version': 0
    }
)
context.run(importer)
print(f'importer.outputs: {importer.outputs}')

statistics_gen = StatisticsGen(examples=importer.outputs['result'])
context.run(statistics_gen)
context.show(statistics_gen.outputs['statistics'])