Regarding the discrepancy between the loss of test_on_batch () and train_on_batch () after re-compile ()

take_hamster · September 11, 2021, 6:42am

Doing compile () twice (whether intentionally or accidentally) dissociates the losses of test_on_batch () and train_on_batch ().
Below is the source code that reproduces this phenomenon.

import os

os.environ["PYTHONHASHSEED"]=str(1234)

import numpy as np
import unittest
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Activation, Dropout
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.models import load_model
import tensorflow as tf
import random as python_random

np.random.seed(123)

python_random.seed(123)

tf.random.set_seed(1234)

initializer = tf.keras.initializers.GlorotUniform(seed=42)


def get_model():
  model = Sequential()
  model.add(Input(shape=(3,)))
  model.add(BatchNormalization())
  model.add(Dense(10,kernel_initializer=initializer))
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  model.add(Dense(4,kernel_initializer=initializer))
  model.compile(optimizer='adam', loss='mean_squared_error')
  model.summary()
  return model


x_data = [
[0.0, 0.0, 0.0],
[0.0, 0.0, 1.0],
[0.0, 1.0, 0.0],
[0.0, 1.0, 1.0],
[1.0, 0.0, 0.0],
[1.0, 0.0, 1.0],
[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
]

y_data = [
[0.0, 0.0, 0.0, 1.0],
[0.0, 1.0, 1.0, 1.0],
[0.0, 1.0, 1.0, 1.0],
[0.0, 1.0, 0.0, 0.0],
[0.0, 1.0, 1.0, 1.0],
[0.0, 1.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0],
[1.0, 1.0, 1.0, 0.0],
]

x_data_keras = np.array(x_data)
y_data_keras = np.array(y_data)


max_loop = 500


# training 1
model = get_model()
for i in range(max_loop):
    result_pre = model.test_on_batch(x_data_keras, y_data_keras)
    result_train = model.train_on_batch(x_data_keras, y_data_keras)
    result_post = model.test_on_batch(x_data_keras, y_data_keras)
    print(result_pre, result_train, result_post)
for i in range(max_loop):
    result_pre = model.test_on_batch(x_data_keras, y_data_keras)
    result_train = model.train_on_batch(x_data_keras, y_data_keras)
    result_post = model.test_on_batch(x_data_keras, y_data_keras)
    print(result_pre, result_train, result_post)
print()


# training 2
model = get_model()
for i in range(max_loop):
    result_pre = model.test_on_batch(x_data_keras, y_data_keras)
    result_train = model.train_on_batch(x_data_keras, y_data_keras)
    result_post = model.test_on_batch(x_data_keras, y_data_keras)
    print(result_pre, result_train, result_post)
model.compile(optimizer='adam', loss='mean_squared_error')     # re-compile
print('re-compile')
for i in range(max_loop):
    result_pre = model.test_on_batch(x_data_keras, y_data_keras)
    result_train = model.train_on_batch(x_data_keras, y_data_keras)
    result_post = model.test_on_batch(x_data_keras, y_data_keras)
    print(result_pre, result_train, result_post)
print()

A part of the output result is put below.

Layer (type)                 Output Shape              Param #   
=================================================================
batch_normalization (BatchNo (None, 3)                 12        
_________________________________________________________________
dense (Dense)                (None, 10)                40        
_________________________________________________________________
batch_normalization_1 (Batch (None, 10)                40        
_________________________________________________________________
activation (Activation)      (None, 10)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 44        
=================================================================
Total params: 136
Trainable params: 110
Non-trainable params: 26
_________________________________________________________________
0.6977941393852234 1.1270804405212402 0.6935483813285828
0.6935483813285828 1.1136053800582886 0.6893106698989868
0.6893106698989868 1.1002421379089355 0.6850830316543579
0.6850830316543579 1.0869929790496826 0.6808691620826721
0.6808691620826721 1.0738604068756104 0.6766723990440369
:
:
0.03179752826690674 0.03175930678844452 0.031733617186546326
0.031733617186546326 0.03169688582420349 0.031668905168771744
0.031668905168771744 0.03162558749318123 0.03159947693347931
0.03159947693347931 0.031562428921461105 0.03153043985366821
0.03153043985366821 0.03150099143385887 0.03146739304065704

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
batch_normalization_2 (Batch (None, 3)                 12        
_________________________________________________________________
dense_2 (Dense)              (None, 10)                40        
_________________________________________________________________
batch_normalization_3 (Batch (None, 10)                40        
_________________________________________________________________
activation_1 (Activation)    (None, 10)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 44        
=================================================================
Total params: 136
Trainable params: 110
Non-trainable params: 26
_________________________________________________________________
0.6977941393852234 1.1270804405212402 0.6935483813285828
0.6935483813285828 1.1136053800582886 0.6893106698989868
0.6893106698989868 1.1002421379089355 0.6850830316543579
0.6850830316543579 1.0869929790496826 0.6808691620826721
0.6808691620826721 1.0738604068756104 0.6766723990440369
:
:
0.001201115665026009 0.00034435521229170263 0.001191912218928337
0.001191912218928337 0.00033795437775552273 0.0011880495585501194
0.0011880495585501194 0.0003311052278149873 0.0011792024597525597
0.0011792024597525597 0.0003247302083764225 0.0011489215539768338
0.0011489215539768338 0.000317717669531703 0.0011202542809769511

training 1 does train_on_batch () 1000 times.
The final loss at the 1000th time is 0.03153043985366821 0.03150099143385887 0.03146739304065704, and there is little difference between the losses shown by test_on_batch () and train_on_batch ().

training 2 does 500 train_on_batch (), compile (), and 500 more train_on_batch ().
The final loss for the 1000th time is 0.0011489215539768338 0.000317717669531703 0.0011202542809769511, which is an order of magnitude different. (The difference in loss is getting bigger and bigger.)

Do you know what causes this phenomenon?
And if this happens, how can I avoid it?

Bhack · September 11, 2021, 10:05am

Using the same model instance I think that you need to manually reset your model weights:

github.com/keras-team/keras

Re-init weights to avoid recompile

opened 11:58PM - 27 Aug 15 UTC

closed 09:20PM - 24 Jun 21 UTC

PilCAki

If you'd like to average NN's with different inits, currently there is no easy m…ethod for getting a different random init of weights without recompiling. Currently, I can train nets on medium sized data in about 1 minute but it sometimes takes several minutes to compile if I've got particularly complex structure. It's possible to save the weights and re-init with the original weights. However, there should be method to reset to a "like-new" state non-identical to first init without having to recompile.

take_hamster · September 11, 2021, 10:52am

How can I recover if I accidentally use compile() ?

take_hamster · September 11, 2021, 12:40pm

Hi.

Further investigation revealed the following.

Occurs when using BatchNormalization. It does not occur unless BatchNormalization is used.
It does not occur when saving with save() and restoring with load_model().
Occurs when saving with save_weights(), initializing the network and restoring with load_weights().

I want to restore the state that I accidentally recompiled to the original state.
(I don’t want to reinitialize)

BRs
take hamster

take_hamster · September 11, 2021, 1:23pm

Add information.

If the optimizer is SGD, the output of training 1 and training 2 seems to match.
This phenomenon occurs when the optimizer is Adam or Adamax.
When the optimizer was Adagrad, Adadelta, Nadam, this did not seem to occur.

Bhack · September 11, 2021, 4:05pm

So you need to save and restore the optimizer state that generally is something done internally by the model saving API (h5/saved_model).

Bhack · September 11, 2021, 8:10pm

You can try your tests with a little bit of private API:

optw = model.optimizer.get_weights()
model.compile(optimizer="Adam", loss='mean_squared_error')     # re-compile
print('re-compile')
model.optimizer._create_all_weights(model.trainable_variables)
model.optimizer.set_weights(optw)

take_hamster · September 11, 2021, 8:52pm

Hi.

I tried the following implementation.

optw = model.optimizer.get_weights()
model.compile(optimizer='adam', loss='mean_squared_error')
print('re-compile')
model.optimizer._create_all_weights(model.trainable_variables)
model.optimizer.set_weights(optw)

This worked, and the outputs of Training 1 and Training 2 now match.
Thank you.

Is it possible to recover a network that has already been re-compiled ?
Or is it okay to continue learning as it is because the display value of loss is just strange and it is operating normally ?

BRs

take hamster

Bhack · September 11, 2021, 9:14pm

The network weights yes but not exactly the optimizer state if you haven’t saved the model or used a trick like this.

You could always try to adapt the initial learning rate as a rought proxy to not break to much your learned network weights.

take_hamster · September 11, 2021, 9:41pm

Hi.

You could always try to adapt the initial learning rate as a rough proxy to not break to much your learned network weights.

I see.
Thank you very much.

BRs
take hamster