TensorFlow Decision Forests with TFX (model serving and evaluation)

Ed_Park · February 22, 2022, 7:57pm

Right. Hopefully someone from the TensorFlow team can address this issue

JanP · February 23, 2022, 7:44am

Mason/Ed, apologies for the inconvenience. Indeed these things should be better integrated, or easier to integrate.

Adding this issue (integration into TFX’s Evaluator and Trainer in Vertex AI in particular) high on our TODO list. We’ll keep this post updated, and provide an ETA when we figure out what is needed .

aliao · February 27, 2022, 3:34pm

Hi @Ed_Park and @anon43767231 ,

Sorry for the delay. Apparently I had some incorrect assumptions about Ed’s pipeline in my original response, but @anon43767231 is correct that one has to use a custom container that includes the TFDF library. The solution to the evaluator issue is indeed dataflow + custom container.

@Ed_Park , thanks for your log and now I understand the problem. You have the line Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner. in your log. It suggests that you may have not set beam_pipeline_args in [1]. In order to use a custom container you have to run your beam job on dataflow and add at least the following beam args to beam_pipeline_args.

"--runner=dataflow",
"--experiments=use_runner_v2",
f"--sdk_container_image={custom_image}",
"--sdk_location=container",

Please refer to [2] for other required options and [3] for details about using custom containers in dataflow.

[1] tfx.v1.dsl.Pipeline | TFX | TensorFlow
[2] Cloud Dataflow Runner
[3] Use custom containers in Dataflow | Google Cloud

anon43767231 · March 3, 2022, 7:39pm

We were able to get the evaluator component working in vertex AI with TFDF without resorting to a custom component. Before, we were trying to set the container used by vertex AI for each individual component in the pipeline, but it turns out we can use a custom container as the default for all of the pipeline components using the orchestration. The KubeflowV2DagRunner accepts a configuration object with which we can set the default image.

So all we had to do was create a Dockerfile that clones the public tfx image: gcr.io/tfx-oss-public/tfx:1.6.1 and then installs TFDF. Then you can push the new image to gcr and use that as your default image for the pipeline. You may still need to use tfx 1.6.0 or 1.6.1 to avoid creating a custom component. We didn’t test it on 1.5.0.

Robert_Crowe · March 3, 2022, 9:44pm

That’s great to hear, thanks for the update!

JanP · March 5, 2022, 12:57pm

Thanks for reporting back @anon43767231 , very appreciated!

Ed_Park · March 11, 2022, 1:10am

Hi @aliao thanks for the correction to use the Dataflow runner.

So, the short story is that I recently forked GitHub - GoogleCloudPlatform/mlops-with-vertex-ai: An end-to-end example of MLOps on Google Cloud using TensorFlow, TFX, and Vertex AI (the best canonical TFX template I’ve found so far) and have been modifying parts of it to work with TFDF. I tried your suggestion of using a custom container but my pipeline stalls for an hour on the very first step of the pipeline (generating the training examples).

The custom image I used is the one that’s generated in the notebook: mlops-with-vertex-ai/04-pipeline-deployment.ipynb at main · GoogleCloudPlatform/mlops-with-vertex-ai · GitHub in section ‘Build the ML container image’. I modified the Dockerfile to clone FROM gcr.io/tfx-oss-public/tfx:1.6.1 and changed requirements.txt so that it includes tensorflow-decision-forests.

Now, if I run the pipeline with the standard BEAM_DATAFLOW_PIPELINE_ARGS (found in tfx_pipelines/config.py):

BEAM_DATAFLOW_PIPELINE_ARGS = [
    f"--project={PROJECT}",
    f"--temp_location={os.path.join(GCS_LOCATION, 'temp')}",
    f"--region={REGION}",
    f"--runner={BEAM_RUNNER}",
]

then the pipeline runs all the components up until it gets to the Evaluator component where it fails with this error:
FileNotFoundError: Op type not registered 'SimpleMLLoadModelFromPathWithHandle' in binary running on beamapp-root-0310210431-2-03101304-5ohk-harness-fvnz. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

If interested, the logs of the Evaluator component being run are here:

Do you know why I’m running into this error where SimpleMLLoadModelFromPathWithHandle can’t be found?

aliao · March 11, 2022, 3:53am

Hi Ed,

Based on the log you are definitely running a dataflow job, you can follow my original post to check the import message if you are interested. The last remaining piece is to get the dataflow job to use your custom image. You can set component level beam args to get around the stalling issue, Evaluator(...).with_id(...).with_beam_pipeline_args([...]).

Best,
Alister

Ed_Park · March 11, 2022, 8:03pm

Hi @aliao, thanks for the reminder about your original post.

I must be doing something wrong when I specify my own container using Evaluator(...).with_id(...).with_beam_pipeline_args([...]) as you suggested - the job starts but then it eventually quits with this error message:
The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.

When I compare this job’s logs with a pipeline run where I don’t specify my own container, I can see that the logs are entirely different. For example, one of the first logged statements are:
Without custom container:
Waiting for training program to start.
I0311 19:18:22.613774 140093240186688 kubeflow_v2_run_executor.py:87] Executor tfx.components.evaluator.executor.Executor do: inputs: {'model': [Artifact(artifact: id: 1992243452189954280
… more TFX related log statements …

With custom container:
Autoscaling is enabled for job 2022-03-11_09_15_28-18094538296848106040. The number of workers will be between 1 and 1000.
… no TFX related log statements …

(let me know if it would be helpful to see all the logs for each scenario)

There must be something that’s missing from my custom image - does this ring a bell with anyone?

Robert_Crowe · March 14, 2022, 6:14pm

Does your custom container ever start running? An easy test is to add echo statements in the startup command.

Ed_Park · March 16, 2022, 6:15pm

Hi @Robert_Crowe, I don’t believe it ever does as I don’t see any TFX logs like I do when the default image gets used. This may be a dumb question but where would the startup commands for my custom image get placed? The image I’m using is the one built from the repo I’ve cloned: mlops-with-vertex-ai/Dockerfile at main · GoogleCloudPlatform/mlops-with-vertex-ai · GitHub.
I’m fairly certain I’m not doing this correctly so any suggestions would be appreciated!

Robert_Crowe · March 17, 2022, 11:01pm

Hi @Ed_Park - When you create a container-based component, you specify a command which is run inside the container when it starts. See the “command” in this example:

Ed_Park · April 7, 2022, 1:32am

Thanks @Robert_Crowe, it’s close but that’s not quite what I need.

In case someone needs concrete directions on how to create a custom container image for a TFX component to use (like the Evaluator) I’m sharing what I needed to do:

Create a Dockerfile for the custom image that includes all required dependencies (like TFDF):

FROM python:3.7-slim

RUN apt-get update -q \
  && apt-get install --no-install-recommends -qy \
  gcc g++

# Install SDK.
RUN pip install --no-cache-dir apache-beam[gcp]==2.37.0 tensorflow-decision-forests tensorflow-model-analysis tensorflow-data-validation

# Verify that the image does not have conflicting dependencies.
RUN pip check

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.7_sdk:2.37.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

Build the image using Cloud Build:

gcloud builds submit . --tag gcr.io/a_location/ml-dataflow-image:latest

(reference: Utiliser des conteneurs personnalisés dans Dataflow | Cloud Dataflow | Google Cloud)

Configure the TFX component to use this image:

evaluator = Evaluator(
        examples=train_example_gen.outputs["examples"],
        example_splits=["test"],
        model=trainer.outputs["model"],
        baseline_model=baseline_model_resolver.outputs["model"],
        eval_config=eval_config,
        schema=schema_importer.outputs["result"],
    ).with_id("ModelEvaluator").with_beam_pipeline_args(config.BEAM_DATAFLOW_PIPELINE_ARGS + ["--sdk_location=container", "--experiments=use_runner_v2", f"--sdk_container_image={config.CUSTOM_DATAFLOW_IMAGE_URI}"])

where CUSTOM_DATAFLOW_IMAGE_URI is the gcr.io URI you defined in step 2.

Now when your TFX pipeline is run, Beam will use the container image you specified to run the component.

Unfortunately, I’m still running in to a problem I’m not sure how to solve. When Beam runs my Evaluator I get this error:

File "/opt/conda/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1638, in <lambda>
wrapper = lambda x: [fn(x)]
File "/opt/conda/lib/python3.7/site-packages/tensorflow_model_analysis/extractors/features_extractor.py", line 124, in extract_features
# instance dict format.
NameError: name '_DropUnsupportedColumnsAndFetchRawDataColumn' is not defined [while running 'ExtractEvaluateAndWriteResults/ExtractAndEvaluate/ExtractFeatures/ExtractFeatures-ptransform-171']

However, I do see a function in the TFMA library called ‘_drop_unsupported_columns_and_fetch_raw_data_column’ defined here:

github.com

tensorflow/model-analysis/blob/d7f42cd5d1f613fe6fceb9fec45e38d89c06fe1c/tensorflow_model_analysis/extractors/features_extractor.py#L97

      
        
                      pa.types.is_string(arrow_type) or
                      pa.types.is_large_string(arrow_type))
            
            

            
# TODO(b/214273030): Move to tfx-bsl.
            def _is_supported_arrow_value_type(arrow_type: pa.DataType) -> bool:
              return (pa.types.is_integer(arrow_type) or pa.types.is_floating(arrow_type) or
                      _is_binary_like(arrow_type))
            
            

            
def _drop_unsupported_columns_and_fetch_raw_data_column(
                record_batch: pa.RecordBatch
            ) -> Tuple[pa.RecordBatch, Optional[np.ndarray]]:
              """Drops unsupported columns and fetches the raw data column.
            
            
  Currently, types that are not binary_like or ListArray[primitive types] are
              dropped.
            
            
  Args:
                record_batch: An Arrow RecordBatch.

Does anyone know how I can get Beam to find the function properly?

Robert_Crowe · April 7, 2022, 6:06pm

I can’t tell from this how you’re referencing _drop_unsupported_columns_and_fetch_raw_data_column. Could you post more of your code? Does it work correctly with the local runner?

Ed_Park · April 8, 2022, 1:41am

Could you post more of your code?

I’ve forked the MLOps with Vertex AI project and only made minor changes to it; the core pipeline code is essentially the same (apart from adding the code to use a custom container image): mlops-with-vertex-ai/training_pipeline.py at main · GoogleCloudPlatform/mlops-with-vertex-ai · GitHub

Does it work correctly with the local runner?

Yes, it does.