Tape.batch_jacobian() and tape.gradient() give different results

model: hidden layers - 8, hidden size - 20, input size - 2, output size - 1, tf 2.4.0

I am confused, why different ways of calculating the second derivative give different results?

@tf.function
def get_vanilla_hess(model, xs):
    with tf.GradientTape(persistent=True) as tape:
        tape.watch(xs)
        ys = model(xs)
        xbar = tape.gradient(ys, xs)
    xbarbar = tape.batch_jacobian(xbar, xs)
    
    return (ys, xbar, xbarbar)
print(get_vanilla_hess(vanilla_model, X_r)[-1][:, 0, 0])

returns

[-0.0004067 , -0.00038697, -0.00037729, ..., -0.00035329,
        -0.00038197, -0.00038998]

while

@tf.function
def get_vanilla_hess_alt(model, xs):
    with tf.GradientTape(persistent=True) as tape:
        tape.watch(xs)
        ys = model(xs)
        xbar = tape.gradient(ys, xs)
    xbarbar = tape.gradient(xbar, xs)
    
    return (ys, xbar, xbarbar)
print(get_vanilla_hess_alt(vanilla_model, X_r)[-1][:, 0])

returns

[-0.00036503, -0.00033761, -0.00032976, ..., -0.00029553,
        -0.00032992, -0.00034215]

Also:
Manually created graph for calculating the hessian returns

[-0.00040658, -0.00038687, -0.00037727, ..., -0.0003532 ,
       -0.00038189, -0.00039003]

Does tape.gradient() + tape.gradient() returns same as tape.gradient() + tape.batch_jacobian() on the diag? (d^2f/dx^2)

@markdaoust can you help here?

1 Like

xbarbar = tape.gradient(xbar, xs)

On this line you’re trying to use tape.gradient to calculate the jacobian.

That’s not how it works.

When you pass gradient a non-scalar as the target, the result is the gradient of the sum of target.

2 Likes