TFX - anomalie component

Lilya_YAHIAOUI · June 19, 2023, 2:16am

I recently started learning TFx; I didn’t understand why when generating statistics and schema for train , then using TFDV validate_statistics between train and Val, I got some anomalies.
However, when generating TF records for train , val, and test at the same time using example_gen_pb2, CsvExampleGen, next, generate statistics using StatisticsGen, then create the schema using the outputs of StatisticsGen, the using example_validator. It doesn’t show any anomalies.

I would like to understand why when generating statistics and schema using train, val, test , then using exampleValidator doesn’t generate any anomlies

chunduriv · June 20, 2023, 9:13am

@Lilya_YAHIAOUI,

Welcome to the Tensorflow Forum!

Can you share us the anomoly detected by TFDV which was not detected by ExampleValidator? The ExampleValidator component internally uses Tensorflow Data Validation to validate the statistics of some splits on input examples against a schema.

The ExampleValidator component only identifies anomalies in training and serving data whereas TFDV additionally can detect training-serving skew and data drift. Also, make sure you are using same schema which anomaly detect using TFDV and ExampleValidator component.

Thank you!

Lilya_YAHIAOUI · June 20, 2023, 5:41pm

@chunduriv
I’m learning TFDV, while exploring different possible ways of data validation, I didn’t get the same result.
The example validator didn’t detect a missing column in the test set, and new values are not appearing in the trainset but are present in the validation and test set.
Here’s how I used example validator.

input_config = example_gen_pb2.Input(splits=[
example_gen_pb2.Input.Split(name=‘train’, pattern=‘train*’),
example_gen_pb2.Input.Split(name=‘val’, pattern=‘val*’),
example_gen_pb2.Input.Split(name=‘test’, pattern=‘test*’)
])

examples = external_input(os.path.join(base_dir, data_dir))
examples_gen = CsvExampleGen(input = examples, input_config=input_config)
context.run(examples_gen)
statistics_gen = StatisticsGen(examples=examples_gen.outputs[‘examples’])
context.run(statistics_gen)
context.show(statistics_gen.outputs[‘statistics’])
schema_gen = SchemaGen(statistics=statistics_gen.outputs[‘statistics’],infer_feature_shape=True)
context.run(schema_gen)
context.show(schema_gen.outputs[‘schema’])
example_validator = ExampleValidator(statistics=statistics_gen.outputs[‘statistics’],schema=schema_gen.outputs[‘schema’])
context.run(example_validator)

In another notebook , I tried TFDV
import tensorflow_data_validation as tfdv
train_stats = tfdv.generate_statistics_from_csv(data_location=‘./data/train/titanic_train.csv’)
tfdv.visualize_statistics(train_stats)
schema = tfdv.infer_schema(train_stats)
tfdv.display_schema(schema)
val_stats = tfdv.generate_statistics_from_csv(data_location=‘./data/val/titanic_val.csv’)
tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats,
lhs_name=‘VAL_DATASET’, rhs_name=‘TRAIN_DATASET’)
anomalies = tfdv.validate_statistics(statistics=val_stats, schema=schema)
tfdv.display_anomalies(anomalies)

chunduriv · June 22, 2023, 11:23am

@Lilya_YAHIAOUI,

This issue seems like with the way schema is being saved in TFX pipeline and TFDV. To make sure can you save the schema generated with TFDV as shown here: اعتبار سنجی داده TensorFlow | TFX and load this saved schema in TFX schemaGen component as shown here: El componente de canalización de SchemaGen TFX | TensorFlow and share us the observations. This will help us understand the issue better.

Thank you!