How does AssignVarOp work?

slai-nick · March 16, 2022, 2:45pm

Hello,

I am experimenting with implementing a Pluggable Device and I struggle to understand how AssignVarOp kernel should work.

With AssignVarOp, the OpKernelContext has two inputs, but expect no output.
The first input is the variable, the second input is the value.
This is analogous to me as a C++ function with a signature like AssignVarOp(TF_Tensor &variable, TF_Tensor value) (this is not meant to represent the C++ implementation in tensorflow, just an analogy where I expose the concept of an input being a reference that will get modified using the second input which contains the value).

The value input is a well formed tensor, with well defined dimensions and data, however the variable doesn’t seem well defined (no dimensions, etc).
There is no functions to modify a tensor object in the C API, nor to (re)set an input tensor.
Therefore, it is unclear to me how to assign the value to the variable just by accessing those two tensors.

Thank you for your interest.

slai-nick · March 19, 2022, 10:16am

It looks like I should implement VarHandleOp to create a ResourceHandle to store in an output tensor with datatype TF_RESOURCE.
Is the ResourceHandle definition arbitrary on the plugin side?
I tried making my own but I am getting problems with it. Also during the attributes passed at the creation of ResourceHandle don’t seem initialised:

container is an empty string
shared_name is a default value
shape is not initialised (negative size, no shape)
allowed_devices (deprecated) doesn’t contain anything

(I’ve got the list & types of attributes by looking at tf.raw_ops.VarHandleOp | TensorFlow Core v2.8.0, finding the information feels like a game of hide & seek).

I keep inspecting the sources to find some information but it is hardly documented for plugins.
I would be really thankful to anyone who could help me understanding the variable handling logic for plugins.

penporn · March 20, 2022, 9:26pm

Hi @slai-nick,

Sorry for the late reply!

AssignVariableOp only has host code so we handle it by registering its DEVICE_DEFAULT kernel (e.g., a fallback kernel any device can use, with tensors stored on host memory). The problem is that we only registered it for the Variant data type so far. To support more data types for this op, you can follow how the data types for DEVICE_CPU or DEVICE_GPU are registered. If you would like to upstream the changes, please feel free to open a GitHub PR and tag me (I’m @penpornk there).

slai-nick · March 20, 2022, 10:12pm

Thank you for taking the time to answer.
It is still a mistery to me what data exactly contains a variable in the context of a plugin implementation. I saw that it is meant to be a container for a ResourceHandle, but I don’t understand if the ResourceHandle should be a custom definition from the plugin developer or a type coming from tensorflow sources.

Also my main problem is that I am just trying to run a simple example on a plugin that only registers a Conv2D kernel, but tensorflow complains that it’s missing the AssigVariableOp.
I would like to understand the reasons why tensorflow needs the AssignVarOp in this case and to understand whether I will need to implement it for my plugin or not.
I should mention that I am not familiar with the variables operators.
Here is the test code:

import tensorflow as tf

input_shape = (1, 28, 28, 3)

x = tf.random.normal(input_shape)
y = tf.keras.layers.Conv2D(12, 3, use_bias=False)(x)

And this is the error:

Traceback (most recent call last):
  File "test_tf_conv2d.py", line 12, in <module>
    y = tf.keras.layers.Conv2D(12, 3, use_bias=False)(x)
  File "/home/ubuntu/python3-venv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'AssignVariableOp' OpKernel for 'PPU' devices compatible with node {{node AssignVariableOp}}
	.  Registered:  device='XLA_CPU'
  device='CPU'; dtype in [DT_QINT32]
  device='CPU'; dtype in [DT_QUINT8]
  device='CPU'; dtype in [DT_QINT8]
  device='CPU'; dtype in [DT_VARIANT]
  device='CPU'; dtype in [DT_RESOURCE]
  device='CPU'; dtype in [DT_STRING]
  device='CPU'; dtype in [DT_BOOL]
  device='CPU'; dtype in [DT_COMPLEX128]
  device='CPU'; dtype in [DT_COMPLEX64]
  device='CPU'; dtype in [DT_DOUBLE]
  device='CPU'; dtype in [DT_FLOAT]
  device='CPU'; dtype in [DT_BFLOAT16]
  device='CPU'; dtype in [DT_HALF]
  device='CPU'; dtype in [DT_INT32]
  device='CPU'; dtype in [DT_INT8]
  device='CPU'; dtype in [DT_UINT8]
  device='CPU'; dtype in [DT_INT16]
  device='CPU'; dtype in [DT_UINT16]
  device='CPU'; dtype in [DT_UINT32]
  device='CPU'; dtype in [DT_INT64]
  device='CPU'; dtype in [DT_UINT64]
 [Op:AssignVariableOp]

slai-nick · March 25, 2022, 11:44am

Would love an update on this please.
My plugin development is getting slew down by not understanding what I need to do for the variable handling.
Having a concrete example of what need to be done for some device X would be very helpful.

slai-nick · April 27, 2022, 8:27am

Can anyone help me with this please?

slai-nick · July 18, 2022, 1:11pm

I think I made some progress with my understanding of how it should work now but there are still some unknown.

ResourceHandle has a dtypes_and_shapes method, why/how a resource can be associated to multiple dtypes and shapes?
Also something weird is that I have a ReadVarOp on a ressource and the shape in the resource is [0, 0, 0, 0] while it was something else in the preceding AssignVarOp (same resource name).

penporn · August 18, 2022, 8:26pm

This fell through the crack. I’m sorry for the late reply!

slai-nick:

tensorflow.python.framework.errors_impl.NotFoundError: No registered 'AssignVariableOp' OpKernel for 'PPU' devices compatible with node {{node AssignVariableOp}}
	.  Registered:  device='XLA_CPU'
  device='CPU'; dtype in [DT_QINT32]
  device='CPU'; dtype in [DT_QUINT8]
  device='CPU'; dtype in [DT_QINT8]
  ...

PR #56936 registers more data types for DEVICE_DEFAULT AssignVariableOp kernels and should be able to help with the issue.

Looping in @PengWang for Variable / ResourceHandle questions.

slai-nick · August 22, 2022, 7:45am

Thank you for answering.
I think I solved my problem by just calling TF_AssignVariable in my kernel implementation.
I was thinking previously that the resource handle was an opaque type I had to implement as a plugin maker and that was causing problems as it seems not to be the case.
It seems that calling TF_AssignVariable will take care of everything.

PengWang · September 2, 2022, 11:07pm

Hi @slai-nick , glad to hear that you’ve made progress. Some responses to your side questions:

y = tf.keras.layers.Conv2D(12, 3, use_bias=False)(x)

keras.layers is a complicated machinery. keras.layers.Conv2D may internally create or assign variables for storing e.g. training step or statistics.

I have a ReadVarOp on a ressource and the shape in the resource is [0, 0, 0, 0] while it was something else in the preceding AssignVarOp (same resource name).

There are two shapes about a variable: (1) the shape of the resource handle (which is basically a pointer packaged as a tensor), which should always be []; (2) the shape of the variable’s value/payload. I guess your [0, 0, 0, 0] here means (2)?

slai-nick · September 6, 2022, 8:11am

Hi, thank you for coming to help.
I don’t remember exactly, but now I understand the concept of variable and resource as you just mentioned.
I was wondering how I should implement it and it seems that using TF_AssignVariable solved my issue.
Otherwise, correct me if I’m wrong, I think that if I were to implement it really (with the variable creation), I would need to use the resource handle proto and store that in the resource tensor.

PengWang · September 6, 2022, 9:59pm

You would need to use the ResourceHandle class (not the proto) and store that in the resource tensor.