Implementation of soft argmax with custom gradient

Sieger_Falkena · December 7, 2021, 3:45am

I am trying to implement the soft argmax operation in Tensorflow. I want to have the normal argmax in the forward pass, and the softmax approximation in the backward pass. To be precise, my input is in format NCHW, where C is 2 channels. The problem that I am facing right now is the dimensionality reduction because of the slicing. My gradient and output are of the same shape (1 channel output), however they differ from the input which is not allowed by Tensorflow as far as I understand.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 409600 values, but the requested shape has 819200

My current best try:

@tf.custom_gradient
def soft_argmax(x):
    out_no_grad = tf.argmax(x, axis=1)

    @tf.function
    def argmax_soft(x):
        # Soft argmax is softmax*index, which simplifies to only the first index for the 2 class case
        out = tf.nn.softmax(x, axis=1)[:,1,:,:]
        return

    def grad(dy):
        gradient = tf.gradients(argmax_soft(x), x)[0]
        return gradient * dy
return out_no_grad, grad

How could I design such a function? Note that I am specifically searching for something that defines a custom gradient or function.
For the idea, this worked, but not the solution that I am looking for:

out_no_grad = tf.argmax(x, axis=1)
out_grad = tf.nn.softmax(out_grad, axis=1)[:,1,:,:]
outputs =  out_grad + tf.stop_gradient(out_no_grad - out_grad)

Edit: In the meantime, I figured out that this implementation is actually working by expanding dims of out_no_grad. To make this post still interesting for future readers, I will make some assumptions and ask some questions about tf.custom_gradient:

So far, my understanding is that the gradient in the grad_fn should be a broadcastable shape with respect to dy, not per say the same shape. Am I correct in this?
However, dy should always have the same shape as x, right?
Can out_no_grad have a different shape/format than the gradient? In my implementation, as stated, I needed the soft_argmax to be in NCHW format, but for an optimization pass later I need it back in NHWC again. Would the following still make sense? I’m getting tangled up a bit in the gradients here. My gut tells me that it would not make sense and will break gradient flow.

@tf.custom_gradient
def soft_argmax(x):
    out_no_grad = tf.argmax(x, axis=1)
    out_no_grad = tf.reshape(out_no_grad, [-1,self.c,self.h,self.w])
    out_no_grad = tf.transpose(out_no_grad, [0,2,3,1])

    @tf.function
    def argmax_soft(x):
        # Soft argmax is softmax*index, which simplifies to only the first index for the 2 class case
        out = tf.nn.softmax(x, axis=1)[:,1,:,:]
        return

    def grad(dy):
        gradient = tf.gradients(argmax_soft(x), x)[0]
        return gradient * dy
return out_no_grad, grad