Getting NaN for loss

i have used the tensorflow book example, but concatenated version of NN fron two different input is output NaN. There is second simpler similar code in which single input is separated and concatenated back which works. I was wondering what is wrong with 2 input code below such that it is outputting NaN???

Here is the code that is output NaN from the output layer (As a debugging effort, I put second code much simpler far below that works.

In brief, here the training layers flow goes like from the code below:
inputA-> → (to concat layer)
inputB->hidden1->hidden2-> (to concat layer) →
concat → output

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras

housing = fetch_california_housing()

X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)

print("X_train/y_train/full/test shapes: ", X_train_full.shape, X_test.shape, y_train_full.shape, y_test.shape)
print("X_/y_/train/valid: ", X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
X_new_A, X_new_B = X_test_A[:3], X_test[:3, 2:]

print("X_train_A/B/valid_A/B: ", X_train_A.shape, X_train_B.shape, X_valid_A.shape, X_valid_B.shape)
print("X_test_A/B/new_A/B: ", X_test_A.shape, X_test_B.shape, X_new_A.shape, X_new_B.shape)

scaler=StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

input_A = keras.layers.Input(shape=[5] , name=“wide_input”)
input_B = keras.layers.Input(shape=[6] , name=“deep_input”)
hidden1 = keras.layers.Dense(30, activation=“relu”)(input_B)
hidden2 = keras.layers.Dense(30, activation=“relu”)(hidden1)
concat = keras.layers.Concatenate()([input_A, hidden2])
output = keras.layers.Dense(1, name=“output”)(concat)
model = keras.Model(inputs=[input_A, input_B], outputs=[output])

model.compile(loss=“mse”, optimizer=keras.optimizers.SGD(lr=1e-3))

history = model.fit(
(X_train_A, X_train_B), y_train,
epochs=20,
validation_data=((X_valid_A, X_valid_B),y_valid)
)
print("training result (shape): ", history)
mse_test = model.evaluate((X_test_A, X_test_B), y_test)
y_pred = model.predict((
X_new_A,
X_new_B))

model.save(“p310.h5”)

output:

11610/11610 [==============================] - 1s 58us/sample - loss: nan - val_loss: nan
Epoch 2/20
11610/11610 [==============================] - 0s 35us/sample - loss: nan - val_loss: nan
Epoch 3/20
11610/11610 [==============================] - 0s 37us/sample - loss: nan - val_loss: nan
Epoch 4/20

Working code:

NN layers:
input_ → hidden1->hidden2-> (to concat layer)
input_ → (to concat layer)
concat ->output

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)

print(“training data shapes:”)
print("X_train_full/test/t_train_full/test: ", X_train_full.shape, X_test.shape, y_train_full.shape, y_test.shape)
print("X_train/X_valid/y_train/y_valid: ", X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

scaler=StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

input_ = keras.layers.Input(shape=X_train.shape[1:])
hidden1 = keras.layers.Dense(30, activation=“relu”)(input_)
hidden2 = keras.layers.Dense(30, activation=“relu”)(hidden1)
concat = keras.layers.Concatenate()([input_, hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.Model(inputs=[input_], outputs=[output])

model.compile(loss=“mse”, optimizer=keras.optimizers.SGD(lr=1e-3))

history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
print("training result (shape): ", history)
mse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3] # prertend these are new instances.
y_preid = model.predict(X_new)

output:

11610/11610 [==============================] - 1s 54us/sample - loss: 1.8919 - val_loss: 0.8798
Epoch 2/20
11610/11610 [==============================] - 0s 32us/sample - loss: 0.8452 - val_loss: 0.7558
Epoch 3/20
11610/11610 [==============================] - 0s 34us/sample - loss: 0.7188 - val_loss: 0.6991
Epoch 4/20
11610/11610 [==============================] - 0s 32us/sample - loss: 0.6705 - val_loss: 0.6597
Epoch 5/20

Hi! The problem is not in the concatenation layer but in how you normalize the input data and how you pass it to the model. You transform X_train but pass X_train_A and X_train_B into the model, which were never transformed by the scaler and contain negative values.
It’s better to use TensorFlow native normalization utilities rather than scalers from other frameworks. This way you will not forget to transform the inputs and will be able to save normalization parameters in a single file as a part of the model.
The second issue is that in a multi-input model values should be represented by a dataset containing dictionary with keys identical to input layer names.

You data preprocessing should look like:

import tensorflow as tf

def process_data(x1, x2, y):
    return {'wide_input': x1, 'deep_input': x2}, y

train_ds = tf.data.Dataset.from_tensor_slices((X_train_A, X_train_B, y_train)).map(process_data).batch(64)

And the same goes for the validation and the test set.
And the model with normalization included should be initialized like this:

normalizer1 = tf.keras.layers.experimental.preprocessing.Normalization(axis=None)
normalizer1.adapt(X_train_A)
normalizer2 = tf.keras.layers.experimental.preprocessing.Normalization(axis=None)
normalizer2.adapt(X_train_B)

input_A = keras.layers.Input(shape=[5] , name='wide_input')
input_B = keras.layers.Input(shape=[6] , name='deep_input')
norm1 = normalizer1(input_A)
norm2 = normalizer2(input_B)
hidden1 = keras.layers.Dense(30, activation='relu')(norm2)
hidden2 = keras.layers.Dense(30, activation='relu')(hidden1)
concat = keras.layers.Concatenate()([norm1, hidden2])
output = keras.layers.Dense(1, name='output')(concat)
model = keras.Model(inputs=[input_A, input_B], outputs=[output])

When you call adapt() on the normalization layer, it learns the scale of the train subset of the data. The model will automatically normalize any new data when asked to predict or evaluate.
And history object does not have shape. It contains a dictionary. You can plot the training progress with pd.DataFrame(history.history).plot()

1 Like

thank you good catch! I did scaling before splitting A and B and now it works! I need to check the book whether I copied it out of order the block of code or book got it wrong. thanks again!