CNN refusing to learn relatively simple task, tips?


I’m trying to create a model to determine the “offset” between two images, normalized -1 to 1 for x and y. Example feature:

The label for this particular image is: Shift X: 0.8125 Shift Y: -0.25 = (0.8125, -0.25)
Input image crops are 64x64 pixels, with the max shift being 16 pixels in either direction for x and y. (Normalization for labels are shift/16 so its between -1 to 1)

I was reading online that siamese CNN models are good at this type of task, so I tried to implement one:

 Layer (type)                   Output Shape         Param #     Connected to
 input_1 (InputLayer)           [(None, 64, 64, 3)]  0           []

 input_2 (InputLayer)           [(None, 64, 64, 3)]  0           []

 sequential (Sequential)        (None, 64)           636064      ['input_1[0][0]',

 concatenate (Concatenate)      (None, 128)          0           ['sequential[0][0]',

 dense_2 (Dense)                (None, 24)           3096        ['concatenate[0][0]']

 dense_3 (Dense)                (None, 2)            50          ['dense_2[0][0]']

Total params: 639,210
Trainable params: 639,210
Non-trainable params: 0

But it’s absolutely REFUSING to learn anything, with the MAE being stuck at ~0.5 (awful), and some tests on the output showing it basically always guesses a shift of (0,0) regardless of how many epochs I train it for. I tried:

  • Increasing number of layers
  • Increasing number of nodes per layer
  • Reducing the max shift

But nothing results in an improvement. I’ve used the most basic version of my network in this post for clarity, but increasing its complexity in the current form has no effect.

Full code: · GitHub
If you want the training dataset for testing yourself, it’s a reduced version of flickr30k:
Data package from February 11th. -

Even just general advice on a better network model or what I might be doing horribly wrong would be greatly appreciated, thank you so much in advance!