Tfdf training Segmentation fault

Hi,
I’m trying to run tfdf on sagemaker, I keep getting segmentation fault, I tried different versions and keep getting the same error.
tensorflow_decision_forests==1.2.0

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:[1,mpirank:0,algo-1]:1051] 4096 examples used for training and 4096 examples used for validation[1,mpirank:0,algo-1]:

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1195] Resume the GBT training from tree #246

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[INFO abstract_model.cc:1248] Engine “[1,mpirank:0,algo-1]:GradientBoostedTreesQuickScorerExtended” built

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] *** Process received signal ***

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] Signal: Segmentation fault (11)

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] Signal code: Address not mapped (1)

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] Failing at address: 0x55a5ce05a478

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 0] [1,mpirank:0,algo-1]:/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fa0ca620420]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 1]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so(+0x762f20)[0x7fa028321f20]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 2]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so(_ZN26yggdrasil_decision_forests7serving15decision_forest7PredictINS1_59GradientBoostedTreesBinaryClassificationQuickScorerExtendedEEEvRKT_RKNS4_10ExampleSetEiPSt6vectorIfSaIfEE+0x30)[0x7fa028322130]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 3]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so(_ZN26yggdrasil_decision_forests5model22gradient_boosted_trees8internal18ComputePredictionsEPKNS1_25GradientBoostedTreesModelEPKNS_7serving10FastEngineERKSt6vectorIPNS0_13decision_tree12DecisionTreeESaISD_EERKNS2_24AllTrainingConfigurationERKNS_7dataset15VerticalDatasetEPSA_IfSaIfEE+0x14a)[0x7fa0281f73ea]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 4]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so(_ZNK26yggdrasil_decision_forests5model22gradient_boosted_trees27GradientBoostedTreesLearner15TrainWithStatusERKNS_7dataset15VerticalDatasetEN4absl12lts_202111028optionalISt17reference_wrapperIS5_EEE+0x278b)[0x7fa02820a0bb]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 5]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so(_ZN27tensorflow_decision_forests3ops20SimpleMLModelTrainer7ComputeEPN10tensorflow15OpKernelContextE+0x855)[0x7fa0281595a5]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 6]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN10tensorflow16ThreadPoolDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x4b)[0x7fa096f0f69b]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [1,mpirank:0,algo-1]:[ 7]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow17KernelAndDeviceOp3RunEPNS_19ScopedStepContainerERKNS_15EagerKernelArgsEPSt6vectorIN4absl12lts_202111027variantIJNS_6TensorENS_11TensorShapeEEEESaISC_EEPNS_19CancellationManagerERKNS8_8optionalINS_19EagerFunctionParamsEEERKNSI_INS_17ManagedStackTraceEEEPNS_24CoordinationServiceAgentE+0x9c7)[0x7fa0a514ce47]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [ 8]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18EagerKernelExecuteEPNS_12EagerContextERKN4absl12lts_2021110213InlinedVectorIPNS_12TensorHandleELm4ESaIS6_EEERKNS3_8optionalINS_19EagerFunctionParamsEEERKSt10unique_ptrINS_15KernelAndDeviceENS_4core15RefCountDeleterEEPNS_14GraphCollectorEPNS_19CancellationManagerENS3_4SpanIS6_EERKNSB_INS_17ManagedStackTraceEEE+0x289)[0x7fa09dc17149]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [1,mpirank:0,algo-1]:[ 9]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow11ExecuteNode3RunEv+0x1c9)[0x7fa09dc18509]

2023-05-29T13:56:30.882+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [1,mpirank:0,algo-1]:[10]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow13EagerExecutor11SyncExecuteEPNS_9EagerNodeE+0x410)[0x7fa0a5778ba0]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [11]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x59e55c6)[0x7fa09dc125c6]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [12]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi+0x254)[0x7fa09dc12c34]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [13]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow14EagerOperation7ExecuteEN4absl12lts_202111024SpanIPNS_20AbstractTensorHandleEEEPi+0x200)[0x7fa09d958e50]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [14]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow21CustomDeviceOpHandler7ExecuteEPNS_27ImmediateExecutionOperationEPPNS_30ImmediateExecutionTensorHandleEPi+0x5da)[0x7fa0a5158d7a]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [1,mpirank:0,algo-1]:[15]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_Execute+0x66)[0x7fa09d0a95b6]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [16]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z24TFE_Py_FastPathExecute_CP7_object+0x25a7)[0x7fa09cd55427]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [1,mpirank:0,algo-1]:[17] /usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tfe.so(+0x69ea7)[0x7fa06b248ea7]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [18] [1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/tensorflow/python/_pywrap_tfe.so(+0x9529f)[0x7fa06b27429f]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [19]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(+0x227513)[0x55a5bfd18513]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [20] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(_PyObject_MakeTpCall+0x8c)[0x55a5bfb64e2c]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [21] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(_PyEval_EvalFrameDefault+0x7fb8)[0x55a5bfb55cd8]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [22] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(+0x1276aa)[0x55a5bfc186aa]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [23]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(_PyFunction_Vectorcall+0x97)[0x55a5bfb65e77]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [24] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(PyVectorcall_Call+0xc6)[0x55a5bfb658a6]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [25]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(_PyEval_EvalFrameDefault+0x20b4)[0x55a5bfb4fdd4]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [26] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(+0x1276aa)[0x55a5bfc186aa]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [27] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(_PyFunction_Vectorcall+0x97)[0x55a5bfb65e77]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [28] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(_PyEval_EvalFrameDefault+0x603a)[0x55a5bfb53d5a]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] [29] [1,mpirank:0,algo-1]:/usr/local/bin/python3.9(+0x1276aa)[0x55a5bfc186aa]

2023-05-29T13:56:30.883+03:00 [1,mpirank:0,algo-1]:[algo-1:00091] *** End of error message ***

Hi Ahad, welcome to the TF Forum!

can you share a codesnippet of what you are trying? just so we can take a look and understand what’s wrong?
are you using a public dataset? which one?

thanks

Gus

Hi, building on what Gus’ said, any configuration details / flags would be useful to debug this. In particular, it looks like you’re resuming training an existing GBT. Can you tell if this is intentional?

1 Like

Hi Gus and Richard,

Thanks for your quick reply.

I didn’t add the code snift since it’s an extensive code repo and I didn’t prepare simpler code (that was my next step). The dataset is private.
I managed to solve it, the code was running with Horovod, and removing it solved my problem.
Resuming existing GBT wasn’t intentional, but as I recall I got a different error on that and removed it. Reading the documentation I saw that there is a fallback in case of no existing checkpoint:

“…If temp_directory does not contain any model checkpoint, start the training from the start.” however my running didn’t act accordingly.

Thanks

2 Likes

I’m glad it’s fixed now!!!
thanks for letting us know!