Keras custom models deteriorates after save and reload

flr · August 16, 2021, 3:15pm

Hi!
I noted that other persons encountered the same following problem, as me.
(python - Predictions become random after loading a custom saved Keras model? - Data Science Stack Exchange)
Description: I created a Keras model with a custom layer and I noticed that the results deteriorates significantly if I save and reload it.
To clarify, in the scenario 1 everything works fine every time. In scenario 1, I build the model then I compile and train it. But in scenario 2 (I build the model, save it untrained, reload it and then I compile and train it) the results deteriorates significantly.

I have been thinking about whether some of the parameters of the custom layers (i.e learning rate) are not correctly saved. But this shouldn’t be the case because I always recompile the model after reloading it. So at this point I do not know what can cause the problem.

I will appreciate any advise.

Bhack · August 16, 2021, 3:19pm

Can you reproduce this with a dummy Colab example?

flr · August 16, 2021, 3:50pm

I have a dummy notebook but the problem is not easily reproducible. Appears at random and not very often.

Bhack · August 16, 2021, 4:32pm

Have you tried with input_signature?

flr · August 17, 2021, 10:10am

Not yet. I did some tests and I believe I understand from where the problem is coming: in the call function I also use three global variables and I believe that the problem comes from the way how the keras model use these variables when it is loaded. To explain: in the Call function, the model parameters are used together with global variables in a formula, in order to compute some results.
The learnt model parameters are almost the same if I train the model after I build it or after I load it. But then, if I complete ‘by hand’ the remaining computations, the results in both scenarios are OK (almost the same - as expected). The only difference is that the loaded model reports wrong (much different than expected) results, although global variables as well as learnt parameters are the same in both scenarios.
So, applying the same computation on the same inputs should result in the same result, but it is not.

The results using same parameters and variables in these scenarios, are different, like this:

for item 123 : scenario is - load + train
po: 2.47, el: -14.24

for item 123 : scenario is - build + train
po: 1.92, el: -3.67

Bhack · August 17, 2021, 10:46am

Yes sorry the comment was for your other retracing thread at:
https://tensorflow-prod.ospodiscourse.com/t/custom-model-trigerring-retracing-warning/3722?u=bhack

Can you reproduce the error in a small Colab?

flr · August 17, 2021, 11:39am

Yes, how can I send it to you? ( I have a local notebook not a colab)

In the dummy notebook are two sets of data and the problem is as follows: starting from scratch, with a first set of data, everything is OK (results for build model + train are the same as the ones for load model + train). The problem appears when I continue with a subsequent set of data: the results for build model + train are different than the ones for load model + train.

Bhack · August 17, 2021, 12:04pm

If you can reproduce this with dummy data you can share a minimal example with a Github gist notebook or a free Google Colab notebook

flr · August 17, 2021, 12:21pm

Here it is. The same model generates also the warning below.

“WARNING:tensorflow:6 out of the last 8 calls to <function Model.make_predict_function…predict_function at 0x7f90184d5310> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to Better performance with tf.function | TensorFlow Core and tf.function | TensorFlow Core v2.6.0 for more details.”

Bhack · August 17, 2021, 4:47pm

Test with second set of data = Not OK (results for build + train are different than load + train)

Where is in the notebook the case build + train?

flr · August 17, 2021, 5:03pm

Example of build + train is:
a) setting up the global variables: pmean, pscale, smean
b) build the model
inputs = tf.keras.Input(shape=(1,))
x = CustomLayer(units=1)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=[x], name=‘model’)
model.summary()
c) compile and train the model

One set of data consists in global variables + features(scaled_xs) + labels(ys).
When I change to another set of data, therefore changing the global variables, features and labels, the results retrieved with a model reloaded and trained differ from the ones obtained with the model rebuilt and trained. Otherwise said: the results obtained when I build the model from scratch (redefined as above) are correct with any set of data but not also in the case I only load a saved (and untrained) model which I want to train after reloading it. To note that the parameters saved by the model are the same in both scenarios (are visible in notebook) so the only difference I can think of is about global variables. Maybe in the second scenario (load + train) the model does not handle properly the new global variables. This is reproducible with the notebook. Hope it somewhat clarifies.

Bhack · August 17, 2021, 5:35pm

Yes I understand but probably I am missing in your notebook this step for the 2nd set of data:

build + train + predict

flr · August 18, 2021, 3:42am

Predict is not necessary for my task. The main objective of the custom layer, is the computation of ‘po’ and ‘el’ variables. So instead using predict, I just read the computed variables. These two are what I am interested to find.

Bhack · August 18, 2021, 11:53am

Ok but It seems that you have annotated only manually the last expected result in the last Cell without the code.

Can you try to fix the seeds in your first import Cell?

flr · August 18, 2021, 12:30pm

Done. I also successfully reproduced the problem using only one set of features, labels and two sets of global variables. It is clear now that the problem is related to them (global vars). I added an updated notebook here:

Bhack · August 18, 2021, 1:48pm

Is that you are in graph mode so you will not going to change pmean and smean in the graph.

You can check yourself, add this in your call:

tf.print(smean)
tf.print(pmean)

You can see the difference when you reload and compile the Model to run in eager mode run_eagerly=True