[Pluggable Device] Help with a segfault on a minimal example

Hello,

I am trying to create a dummy pluggable device following the community tutorial.
I have implemented a minimal stream executor (works with calls malloc) kernels for

  • Conv2D
  • AssignVariableOp
  • ReadVariableOp

I am now trying to run a simple test script:

import tensorflow as tf

input_shape = (4, 28, 28, 3)
x = tf.random.uniform(input_shape)
y = tf.keras.layers.Conv2D(12, 3, use_bias=False)(x)

I am getting a segfault, this is what the back trace looks like:

(gdb) bt
#0  memcmp () at ../sysdeps/aarch64/memcmp.S:53
#1  0x0000ffff95e05050 in tensorflow::internal::ValidateDevice(tensorflow::OpKernelContext*, tensorflow::ResourceHandle const&) ()
   from /home/ubuntu/python3-venv/tensorflow/lib/libtensorflow_framework.so.2
#2  0x0000ffff95e08a64 in tensorflow::DeleteResource(tensorflow::OpKernelContext*, tensorflow::ResourceHandle const&) ()
   from /home/ubuntu/python3-venv/tensorflow/lib/libtensorflow_framework.so.2
#3  0x0000ffff9cc0f640 in tensorflow::DestroyResourceOp::Compute(tensorflow::OpKernelContext*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x0000ffff9a4b0768 in tensorflow::PluggableDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x0000ffffa0bb818c in tensorflow::KernelAndDeviceOp::Run(tensorflow::ScopedStepContainer*, tensorflow::EagerKernelArgs const&, std::vector<absl::lts_20210324::variant<tensorflow::Tensor, tensorflow::TensorShape>, std::allocator<absl::lts_20210324::variant<tensorflow::Tensor, tensorflow::TensorShape> > >*, tensorflow::CancellationManager*, absl::lts_20210324::optional<tensorflow::EagerFunctionParams> const&, absl::lts_20210324::optional<tensorflow::ManagedStackTrace> const&, tensorflow::CoordinationServiceAgent*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x0000ffff9b497be0 in tensorflow::EagerKernelExecute(tensorflow::EagerContext*, absl::lts_20210324::InlinedVector<tensorflow::TensorHandle*, 4ul, std::allocator<tensorflow::TensorHandle*> > const&, absl::lts_20210324::optional<tensorflow::EagerFunctionParams> const&, std::unique_ptr<tensorflow::KernelAndDevice, tensorflow::core::RefCountDeleter> const&, tensorflow::GraphCollector*, tensorflow::CancellationManager*, absl::lts_20210324::Span<tensorflow::TensorHandle*>, absl::lts_20210324::optional<tensorflow::ManagedStackTrace> const&) () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x0000ffff9b498d44 in tensorflow::ExecuteNode::Run() () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x0000ffffa0ff391c in tensorflow::EagerExecutor::SyncExecute(tensorflow::EagerNode*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x0000ffff9b4952c0 in tensorflow::(anonymous namespace)::EagerLocalExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x0000ffff9b49598c in tensorflow::EagerExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x0000ffff9b1eb49c in tensorflow::EagerOperation::Execute(absl::lts_20210324::Span<tensorflow::AbstractTensorHandle*>, int*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x0000ffffa0bc44d8 in tensorflow::CustomDeviceOpHandler::Execute(tensorflow::ImmediateExecutionOperation*, tensorflow::ImmediateExecutionTensorHandle**, int*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x0000ffff9af919bc in TFE_Execute () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x0000ffff9aba1cb4 in TFE_Py_FastPathExecute_C(_object*) () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x0000ffff93362214 in pybind11::cpp_function::initialize<pybind11_init__pywrap_tfe(pybind11::module_&)::{lambda(pybind11::args)#52}, pybind11::object, pybind11::args, pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init__pywrap_tfe(pybind11::module_&)::{lambda(pybind11::args)#52}&&, pybind11::object (*)(pybind11::args), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so
#16 0x0000ffff933960bc in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so

For more context, I have a bunch of debug printing everywhere to see what’s going on and the segfault seems to occur after a terminated call to deallocate function of the stream executor.

I looked at the source of tensorflow::internal::ValidateDevice and the memcmp is done on

  • ctx->device()->attributes().name() (OpKernelContext* ctx)
  • p.device() (ResourceHandle& p)

So I assume one of these has invalid memory, I don’t know which and where and I struggle to find what in my code is causing it.

I am not putting the whole code here as there are quite some bits despite being minimal, but I can post more code on demand.

Thank you in advance for your help.

PS: I particularly struggle to understand how the AssignVarOp and ReadVarOp kernels should be implemented and I believe the problem is coming from here.

Solved the issue, something was done wrong in my code, not related to tensorflow code.