How can we use schema from schema gen component in trainer component?

In trainer component, the examples i am passing is from examplegen component.
Am also passing schema from schema gen component.

Now how can i use this schema in my input_function (defined under module file ) ?

Need to call my schema gen putput similar as calling transfrom output in this example-> tfx/tfx/examples/chicago_taxi_pipeline/taxi_utils_native_keras.py at master · tensorflow/tfx (github.com)

Also wanted to pass this schema in input function so as to use it( example- tfx/docs/tutorials/tfx/penguin_simple.ipynb at master · tensorflow/tfx (github.com)

A custome feature spec is used here Simple TFX Pipeline Tutorial using Penguin dataset  |  TensorFlow, how can i use schema from schema gen component here ?

It’s unusual to use the examples from Examplegen in the Trainer component, but the schema from Schemagen will match. Usually you would send the transformed examples from the Transform component, and the schema from Transform, to Trainer.

@Robert_Crowe
In the following example of tfx in Github( tfx/docs/tutorials/tfx/penguin_simple.ipynb at master · tensorflow/tfx (github.com)

Transform component haven’t been used, example_gen.outputs['examples'] has been passed directly to the trainer component as examples.
Also as they are not generating or creating a schema, they instead created a feature spec because there were a fairly small number of features .

What to do if we have large number of features, in my case am using a dataset with 100+ columns ?

Is it possible to use the schema (generated from schema gen ) into the trainer component ?
(Schema is an argument for trainer component tfx.v1.components.Trainer | TFX | TensorFlow)

If yes, how ?

Is it possible to use the schema (generated from schema gen ) into the trainer component ?

Yes, that’s exactly what SchemaGen is for. The reason that example does it that way was that we were trying to keep the pipeline as small as possible. It’s not the normal way of doing things.

@Robert_Crowe
Can you provide any reference/documentation for the same (using schema ,generated from schema gen into the trainer component)?

In the provided example from the TensorFlow Extended (TFX) GitHub repository, it’s true that the Transform component hasn’t been used, and the example_gen.outputs['examples'] has been directly passed to the Trainer component as examples. This approach can be suitable when you have a relatively small number of features and don’t need to generate a schema. However, when dealing with a dataset containing 100+ columns or a large number of features, it’s often beneficial to use a schema.

Using a schema generated from SchemaGen in the Trainer component is indeed possible in TFX. A schema provides valuable information about the expected data types and properties of your features, which can be helpful for validation and preprocessing.

Here’s how you can use the schema generated from SchemaGen in the Trainer component:

  1. Generate the Schema: First, you need to ensure that you have a SchemaGen component in your TFX pipeline to generate the schema. This component analyzes your data and generates a schema based on the statistics of the data.
  2. Pass the Schema to Trainer: In your TFX pipeline configuration or Python script, you can pass the generated schema to the Trainer component as an argument. The schema should be passed as part of the trainer_fn_args in the Trainer component configuration.

Here’s a simplified example of how to configure your Trainer component to use the schema:

from tfx.components import Trainer
from tfx.proto import trainer_pb2

# Assuming you have a generated schema, schema_file is the path to your schema.
schema_file = "path/to/generated/schema.pbtxt"

# Create a Trainer component.
trainer = Trainer(
    ...
    trainer_fn_args=trainer_pb2.TrainerFnArgs(
        schema=schema_file,
        # Other arguments...
    )
)

By specifying the schema argument with the path to your schema file, you’re informing the Trainer component to use this schema for feature validation and preprocessing.

Using the schema in the Trainer component is particularly beneficial when dealing with a large number of features, as it helps ensure data consistency and proper handling of feature types.

You can also visit this blog Apkpitch to get more about the Android.

1 Like