Working on a pluggable device, I am using
TF_ForwardInputOrAllocateOutput in some of my kernels but I don’t see the effect of it in practice.
Here is a simple example:
import tensorflow as tf @tf.function def run_graph(model, x): return model(x) shape = [2, 2] x = tf.keras.Input(shape, batch_size=1) y = x y = tf.keras.layers.Add()([y, y]) y = tf.keras.layers.ReLU()(y) y = tf.keras.layers.Add()([y, y]) y = tf.keras.layers.ReLU()(y) model = tf.keras.Model(inputs=x, outputs=y) x = tf.constant( [[ [1, 2], [3, 4], ]] ) y = run_graph(model, x) print(y)
On the pluggin side, I put the device memory addresses of the each input/output tensors in each kernel:
Op: Cast ( 1) Output: mem0005 Input(s): mem0004 Op: Mul ( 1) Output: mem0006 Input(s): mem0002 mem0005 Op: Relu ( 1) Output: mem0007 Input(s): mem0006 Op: Mul ( 2) Output: mem0008 Input(s): mem0007 mem0003 Op: Relu ( 2) Output: mem0009 Input(s): mem0008
What I would expect is to see the output address of each operation (except the first one to not override
x) being one of its inputs ones since I use forwarding in each of these operations.