Kernel crashed using tfx and tensorflow-macos on intel mac

I got a Kernel crashed while executing code while trying out the penguin_trainer example using
tensorflow-macos, tensorflow-metal and tfx on an Intel mac

tensorflow-macos==2.9.2
tensorflow-metal==0.6.0
tfx==1.10.0

Does tfx work with tensorflow-macos and tensorflow-metal ? Any hints are really appreciated.

Thank you for the report!
I have no experience with tensorflow-macos and tensorflow-metal. But could you share more context with the failure? For example, the exact step the failure happens, or the related failure trace would be helpful to reproduce the issue.(I think that it might be not related but the OS version or Python version would be also helpful!)

@jiyongjung0 Thanks for looking into this, and hopefully the following info would be helpful.
My OS version: macOS Monterey 12.6.1
Python version: 3.9.15

This happens to the step

    tfx.orchestration.LocalDagRunner().run(
    _create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_root=DATA_ROOT,
        module_file=_trainer_module_file,
        serving_model_dir=SERVING_MODEL_DIR,
        metadata_path=METADATA_PATH))

It seems to me that the model.compile() wasn’t successful, and no model.summary() was visable in trace.

I only got a bus error, please see the trace below.
I am new to the tfx and currently testing it on an intel based mac and gpu.

execution_options {
  caching_options {
  }
}
, pipeline_info=id: "penguin-simple"
, pipeline_run_id='2022-11-16T23:59:29.131649')
DEBUG:absl:Starting GenericExecutor execution.
DEBUG:absl:Inputs for GenericExecutor are: {"examples": [{"artifact": {"id": "3", "type_id": "15", "uri": "pipelines/penguin-simple/CsvExampleGen/examples/5", "properties": {"split_names": {"string_value": "[\"train\", \"eval\"]"}}, "custom_properties": {"input_fingerprint": {"string_value": "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1668639568,sum_checksum:1668639568"}, "tfx_version": {"string_value": "1.10.0"}, "span": {"int_value": "0"}, "name": {"string_value": "penguin-simple:2022-11-16T23:59:29.131649:CsvExampleGen:5:examples:0"}, "file_format": {"string_value": "tfrecords_gzip"}, "payload_format": {"string_value": "FORMAT_TF_EXAMPLE"}}, "state": "LIVE", "name": "penguin-simple:2022-11-16T23:59:29.131649:CsvExampleGen:5:examples:0", "create_time_since_epoch": "1668639570137", "last_update_time_since_epoch": "1668639570137"}, "artifact_type": {"id": "15", "name": "Examples", "properties": {"version": "INT", "span": "INT", "split_names": "STRING"}, "base_type": "DATASET"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "Examples"}]}
DEBUG:absl:Outputs for GenericExecutor are: {"model": [{"artifact": {"uri": "pipelines/penguin-simple/Trainer/model/6", "custom_properties": {"name": {"string_value": "penguin-simple:2022-11-16T23:59:29.131649:Trainer:6:model:0"}}, "name": "penguin-simple:2022-11-16T23:59:29.131649:Trainer:6:model:0"}, "artifact_type": {"name": "Model", "base_type": "MODEL"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "Model"}], "model_run": [{"artifact": {"uri": "pipelines/penguin-simple/Trainer/model_run/6", "custom_properties": {"name": {"string_value": "penguin-simple:2022-11-16T23:59:29.131649:Trainer:6:model_run:0"}}, "name": "penguin-simple:2022-11-16T23:59:29.131649:Trainer:6:model_run:0"}, "artifact_type": {"name": "ModelRun"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "ModelRun"}]}
DEBUG:absl:Execution properties for GenericExecutor are: {"module_path": "penguin_trainer@pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+408b8b81d7a07b404b37eb24f0fa68fa625592aac326d0c18f4a1b27c4c68eba-py3-none-any.whl", "eval_args": "{\n  \"num_steps\": 5\n}", "train_args": "{\n  \"num_steps\": 100\n}", "custom_config": "null"}
INFO:absl:Train on the 'train' split when train_args.splits is not set.
INFO:absl:Evaluate on the 'eval' split when eval_args.splits is not set.
INFO:absl:udf_utils.get_fn {'module_path': 'penguin_trainer@pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+408b8b81d7a07b404b37eb24f0fa68fa625592aac326d0c18f4a1b27c4c68eba-py3-none-any.whl', 'eval_args': '{\n  "num_steps": 5\n}', 'train_args': '{\n  "num_steps": 100\n}', 'custom_config': 'null'} 'run_fn'
INFO:absl:Installing 'pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+408b8b81d7a07b404b37eb24f0fa68fa625592aac326d0c18f4a1b27c4c68eba-py3-none-any.whl' to a temporary directory.
INFO:absl:Executing: ['/Users/yingding/VENV/tfx3.9/bin/python3', '-m', 'pip', 'install', '--target', '/var/folders/lf/c7xnmwnx17330xrvry7lq7j80000gp/T/tmp_nk5gsxg', 'pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+408b8b81d7a07b404b37eb24f0fa68fa625592aac326d0c18f4a1b27c4c68eba-py3-none-any.whl']
Processing ./pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+408b8b81d7a07b404b37eb24f0fa68fa625592aac326d0c18f4a1b27c4c68eba-py3-none-any.whl
Installing collected packages: tfx-user-code-Trainer
Successfully installed tfx-user-code-Trainer-0.0+408b8b81d7a07b404b37eb24f0fa68fa625592aac326d0c18f4a1b27c4c68eba
INFO:absl:Successfully installed 'pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+408b8b81d7a07b404b37eb24f0fa68fa625592aac326d0c18f4a1b27c4c68eba-py3-none-any.whl'.
INFO:absl:Training model.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
2022-11-16 23:59:32.140959: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-11-16 23:59:32.141002: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
zsh: bus error  /Users/yingding/VENV/tfx3.9/bin/python3 

Thank you for sharing. I don’t have a suitable device for the reproduction of the error, but it seems like it fails during the model training as you mentioned. TFX doesn’t have any control in model training code itself, so I’m afraid that I cannot give much help from TFX side.

@jiyongjung0 Thanks for your help. Currently, tfx doesn’t seem to work for tensorflow-macos with tensorflow-metal on Apple silicon mac. I entrusiasticly want to try out tfx with tensforflow-macos using GPU on intel mac.

Since I am using the default tfx intro penguin example, I doubt it is an issue of the training code itself. Can you forward me some info how tfx sends the trainer.py to trigger the training? Something might be gone wrong with tensorflow-macos.

If the tfx team still want to get the bottom of it, i would be grateful to make some tests on my device. Probably it would be better to spend energy to make tfx happen with apple silicon and tensorflow-macos. Please let me know if there is anything i can help.

@jiyongjung0 I think I know what happens in my case on intel based macbook using tfx.
I tested this time only tensorflow-macos without the tensorflow-metal just to use cpu.

Working

python3 -m pip install tfx==1.10.0 tensorflow-macos>=2.9.0

and I found the following dependency packages installed by tfx==1.0.0

tensorflow                      2.9.3
tensorflow-data-validation      1.10.0
tensorflow-estimator            2.9.0
tensorflow-hub                  0.12.0
tensorflow-io-gcs-filesystem    0.27.0
tensorflow-macos                2.9.2
tensorflow-metadata             1.10.0
tensorflow-model-analysis       0.41.1
tensorflow-serving-api          2.9.2
tensorflow-transform            1.10.1

basicly, the tensorflow==2.9.3 is demanded as training backend by the tfx. So I get two working tensorflow binary on an intel-based mac, tensorflow and tensorflow-macos

As I run the tfx penguin example with tensorflow==2.9.3 and not tensorflow-macos, the training goes through.

Please correct me if i am wrong, tfx depends on tensorflow and can not work with tensorflow-macos with GPU activated using tensorflow-metal , thus also the bus error in case i install tensorflow-macos with tensorflow-metal.

Not Working

python3 -m pip install tfx==1.10.0 tensorflow-macos>=2.9.0 tensorflow-metal>=0.5.0

packages installed are:

tensorflow                      2.9.3
tensorflow-data-validation      1.10.0
tensorflow-estimator            2.9.0
tensorflow-hub                  0.12.0
tensorflow-io-gcs-filesystem    0.27.0
tensorflow-macos                2.9.2
tensorflow-metadata             1.10.0
tensorflow-metal                0.6.0
tensorflow-model-analysis       0.41.1
tensorflow-serving-api          2.9.2
tensorflow-transform            1.10.1

Would it be possible to make tfx work with tensorflow-macos binary without tensorflow binary? I think it is an issue with tfx doesn’t support tensorflow-macos.

Thank you for sharing your findings! It seems like ‘tensorflow-macos’ overwrite the content of ‘tensorflow’ and it might cause a problem.

You can manually workaround it by installing TFX first (with tensorflow), remove tensorflow and re-install tensorflow-macos.

tensorflow is required to use TFX and there is no way to specify this kind of optional dependency for ‘tensorflow-macos’. Please let me know if you are aware of any way to support it.

With the new tfx==1.13.0, i am actually able to get TFX on an Intel mac with GPU work.

To my surprise, both the following settings allows GPU accelerator training.

The tensorflow binary and tensorflow-metal

tensorflow==2.12.0
tensorflow-metal==0.8.0
tfx==1.13.0
pyarrow==6.0.0

tensorflow-macos binary with tensorflow-metal

tensorflow-macos==2.12.0
tensorflow==2.12.0
tensorflow-metal==0.8.0
tfx==1.13.0
pyarrow==6.0.0
1 Like