What is the best approach to debug in TensorFlow when working with the C++ code base?

Crefeda_Rodrigues · May 6, 2021, 8:25am

In the past, I’ve had issues debugging in TensorFlow where the problem was somewhere in the C++ code base and I was using gdb, these included debug builds being too large (using -O0) and running out of space, recompile time etc. Does anyone have recommendations to handle debugging in TensorFlow?

Bhack · May 10, 2021, 11:55pm

I think that some of these problems are well known. For a recent experience you can follow this:

github.com/tensorflow/tensorflow

cannot build TensorFLow with --config=dbg

opened 07:49PM - 05 May 21 UTC

closed 11:27PM - 22 Jun 21 UTC

bas-aarts

stat:awaiting tensorflower type:build/install subtype: ubuntu/linux subtype:bazel

when building opensource TensorFlow with bazel build --config=dbg --config=…cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package (for SM 7.0 only) The build dies at link time with: `ERROR: /home/baarts/tensorflow-GH/tensorflow/python/BUILD:3373:24: Linking of rule '//tensorflow/python:_pywrap_tensorflow_internal.so' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc @bazel-out/k8-dbg/bin/tensorflow/python/_pywrap_tensorflow_internal.so-2.params bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(AnnotationRemarks.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(BDCE.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(CallSiteSplitting.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(ConstantHoisting.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(ConstraintElimination.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(CorrelatedValuePropagation.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(DCE.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(DeadStoreElimination.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(DivRemPairs.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(EarlyCSE.pic.o):(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info' bazel-out/k8-dbg/bin/external/llvm-project/llvm/libScalar.a(FlattenCFGPass.pic.o):(.debug_aranges+0x6): additional relocation overflows omitted from the output collect2: error: ld returned 1 exit status` Adding -mcmodel=large makes no difference, as the overflow is in a debug section. I tried -gdwarf64 which is not supported by gcc some platform info: ``` root@7fe23091cb5b:/opt/tensorflow# gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. root@7fe23091cb5b:/opt/tensorflow# uname -a Linux 7fe23091cb5b 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ```

In the end there is a draft proposal so If you have something technical to share about your experience please leave a comment in the ticket.

Crefeda_Rodrigues · May 18, 2021, 2:27pm

Thanks, that does contain some useful info.

mihaimaruseac · May 19, 2021, 3:15pm

I found that the best way to debug is printf-debugging without checking again from the head of the repository (because that would result in longer compile times again).

If possible, building with ASAN also helps. The OSSFuzz docker container allows that.