Hi,

I have been teaching myself the mathematics of backpropagation and been using keras to check my results for errors.

Using a simple modal with input layer with 1 input, dense with 1 neuron (d1), dense with 1 neuron (d2), and output with 1 neuron (o1), I expect the following calculation to have been performed:

(error derivative) * (o1 activation derivative) * (d2 output value) * (d2 activation derivative) * (d1 output value) * (d1 activation derivative) * (input value)

instead, the result comes from the following calculation.

(error derivative) * (o1 activation derivative) * (d1 output value) * (d2 activation derivative) * (d1 activation derivative)

No matter how many layers are added in a row, only the final output layer is being calculated with the output value from the previous neuron connected to it. Why?

Having understood that happens and adjusting for it, I moved on to having more neurons in layers with the following modal:

input layer with 3 inputs, dense with 1 neuron (d1), dense with 4 neurons (d2), and output with 2 neurons (o1), I expect the following calculations to have been performed:

foreach output neuron:

gradient = (error derivative) * (o1 activation derivative) * (d2 output value) foreach output neuron]

foreach d2 neuron:

gradient = (sum of gradient from each output neuron) * (d2 activation derivative)

finally

(sum of gradient from each d2 neuron) * (d1 activation derivative)

Instead, in the final calculation I am seeing:

((sum of gradient from each d2 neuron) / (number of input neurons)) * (d1 activation derivative)

Why is there division when calculating the gradients for weights from input layer?

note: batch size = 1, epochs = 1