Getting NaN for loss

guen_gn · October 5, 2021, 1:59am

i have used the tensorflow book example, but concatenated version of NN fron two different input is output NaN. There is second simpler similar code in which single input is separated and concatenated back which works. I was wondering what is wrong with 2 input code below such that it is outputting NaN???

Here is the code that is output NaN from the output layer (As a debugging effort, I put second code much simpler far below that works.

In brief, here the training layers flow goes like from the code below:
inputA-> → (to concat layer)
inputB->hidden1->hidden2-> (to concat layer) →
concat → output

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras

housing = fetch_california_housing()

X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)

print("X_train/y_train/full/test shapes: ", X_train_full.shape, X_test.shape, y_train_full.shape, y_test.shape)
print("X_/y_/train/valid: ", X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
X_new_A, X_new_B = X_test_A[:3], X_test[:3, 2:]

print("X_train_A/B/valid_A/B: ", X_train_A.shape, X_train_B.shape, X_valid_A.shape, X_valid_B.shape)
print("X_test_A/B/new_A/B: ", X_test_A.shape, X_test_B.shape, X_new_A.shape, X_new_B.shape)

scaler=StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

input_A = keras.layers.Input(shape=[5] , name=“wide_input”)
input_B = keras.layers.Input(shape=[6] , name=“deep_input”)
hidden1 = keras.layers.Dense(30, activation=“relu”)(input_B)
hidden2 = keras.layers.Dense(30, activation=“relu”)(hidden1)
concat = keras.layers.Concatenate()([input_A, hidden2])
output = keras.layers.Dense(1, name=“output”)(concat)
model = keras.Model(inputs=[input_A, input_B], outputs=[output])

model.compile(loss=“mse”, optimizer=keras.optimizers.SGD(lr=1e-3))

history = model.fit(
(X_train_A, X_train_B), y_train,
epochs=20,
validation_data=((X_valid_A, X_valid_B),y_valid)
)
print("training result (shape): ", history)
mse_test = model.evaluate((X_test_A, X_test_B), y_test)
y_pred = model.predict((
X_new_A,
X_new_B))

model.save(“p310.h5”)

output:

11610/11610 [==============================] - 1s 58us/sample - loss: nan - val_loss: nan
Epoch 2/20
11610/11610 [==============================] - 0s 35us/sample - loss: nan - val_loss: nan
Epoch 3/20
11610/11610 [==============================] - 0s 37us/sample - loss: nan - val_loss: nan
Epoch 4/20

Working code:

NN layers:
input_ → hidden1->hidden2-> (to concat layer)
input_ → (to concat layer)
concat ->output

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)

print(“training data shapes:”)
print("X_train_full/test/t_train_full/test: ", X_train_full.shape, X_test.shape, y_train_full.shape, y_test.shape)
print("X_train/X_valid/y_train/y_valid: ", X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

scaler=StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

input_ = keras.layers.Input(shape=X_train.shape[1:])
hidden1 = keras.layers.Dense(30, activation=“relu”)(input_)
hidden2 = keras.layers.Dense(30, activation=“relu”)(hidden1)
concat = keras.layers.Concatenate()([input_, hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.Model(inputs=[input_], outputs=[output])

model.compile(loss=“mse”, optimizer=keras.optimizers.SGD(lr=1e-3))

history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
print("training result (shape): ", history)
mse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3] # prertend these are new instances.
y_preid = model.predict(X_new)

output:

11610/11610 [==============================] - 1s 54us/sample - loss: 1.8919 - val_loss: 0.8798
Epoch 2/20
11610/11610 [==============================] - 0s 32us/sample - loss: 0.8452 - val_loss: 0.7558
Epoch 3/20
11610/11610 [==============================] - 0s 34us/sample - loss: 0.7188 - val_loss: 0.6991
Epoch 4/20
11610/11610 [==============================] - 0s 32us/sample - loss: 0.6705 - val_loss: 0.6597
Epoch 5/20

Ekaterina_Dranitsyna · October 5, 2021, 12:11pm

Hi! The problem is not in the concatenation layer but in how you normalize the input data and how you pass it to the model. You transform X_train but pass X_train_A and X_train_B into the model, which were never transformed by the scaler and contain negative values.
It’s better to use TensorFlow native normalization utilities rather than scalers from other frameworks. This way you will not forget to transform the inputs and will be able to save normalization parameters in a single file as a part of the model.
The second issue is that in a multi-input model values should be represented by a dataset containing dictionary with keys identical to input layer names.

You data preprocessing should look like:

import tensorflow as tf

def process_data(x1, x2, y):
    return {'wide_input': x1, 'deep_input': x2}, y

train_ds = tf.data.Dataset.from_tensor_slices((X_train_A, X_train_B, y_train)).map(process_data).batch(64)

And the same goes for the validation and the test set.
And the model with normalization included should be initialized like this:

normalizer1 = tf.keras.layers.experimental.preprocessing.Normalization(axis=None)
normalizer1.adapt(X_train_A)
normalizer2 = tf.keras.layers.experimental.preprocessing.Normalization(axis=None)
normalizer2.adapt(X_train_B)

input_A = keras.layers.Input(shape=[5] , name='wide_input')
input_B = keras.layers.Input(shape=[6] , name='deep_input')
norm1 = normalizer1(input_A)
norm2 = normalizer2(input_B)
hidden1 = keras.layers.Dense(30, activation='relu')(norm2)
hidden2 = keras.layers.Dense(30, activation='relu')(hidden1)
concat = keras.layers.Concatenate()([norm1, hidden2])
output = keras.layers.Dense(1, name='output')(concat)
model = keras.Model(inputs=[input_A, input_B], outputs=[output])

When you call adapt() on the normalization layer, it learns the scale of the train subset of the data. The model will automatically normalize any new data when asked to predict or evaluate.
And history object does not have shape. It contains a dictionary. You can plot the training progress with pd.DataFrame(history.history).plot()

guen_gn · October 8, 2021, 1:22am

thank you good catch! I did scaling before splitting A and B and now it works! I need to check the book whether I copied it out of order the block of code or book got it wrong. thanks again!

guen_gn · November 9, 2021, 10:16pm

I am getting anohter NaN for completely different code, this time encode-decoder network. I am still going over the code to see whether it has failed for same reason above or something else. I justed pasted here in case trained eye can see it more easily:

#encoder decoder network.

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
import pandas as pd
import matplotlib as plt
import sys
import time
import re
import numpy as np
import helper

from collections import Counter
from tensorflow import keras

print(tf.__version__)
print(keras.__version__)

DEBUG=0

CONFIG_ENABLE_PLOT=0
CONFIG_SAVE_MODEL=0

DEBUG=0
CONFIG_ENABLE_PLOT=0
CONFIG_EPOCHS=5
CONFIG_BATCH_SIZE=32

for i in sys.argv:
    print("Processing ", i)
    try:
        if re.search("epochs=", i):
            CONFIG_EPOCHS=int(i.split('=')[1])

        if re.search("batch_size=", i):
            CONFIG_BATCH_SIZE=int(i.split('=')[1])

    except Exception as msg:
        print("No argument provided, default values will be used.")

print("epochs: ", CONFIG_EPOCHS)
print("batch_size: ", CONFIG_BATCH_SIZE)

'''

if  len(sys.argv) > 1:
    CONFIG_EPOCHS, CONFIG_BATCH_SIZE = helper.process_params(sys.argv, ["epochs", "batch_size"])
'''

(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()
print("X_train[:10]: ", X_train[0][:10])

word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ +  3: word for word, id_ in word_index.items()}
for id_,  token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token

" ".join([id_to_word[id_] for id in X_train[0][:10]])

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info = True)
train_size = info.splits["train"].num_examples
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, b">br\\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

print("most common words in vocabulary: ", vocabulary.most_common()[:3])

embed_size = 128
vocab_size = 10000

truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]

words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init=tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

encoder = keras.layers.LSTM(512, return_state = True)
encoder_outputs = keras_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]

sampler = tfa.seq2seq.sampler.TrainingSampler()

decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler, output_layer = output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(decoder_embeddings, initial_state = encoder_state, \
    sequence_length=sequence_lengths)
y_proba = tf.nn.softmax(final_outputs.rnn_output)

model = keras.Model(inputs=[encoder_inputs, decoder_inputs, sequence_lengths], outputs=[y_proba])

distribution = tf.distribute.MirroredStrategy()

with distribution.scope():
    model=keras.models.Sequential([\
        keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, input_shape=[None]), \
        keras.layers.GRU(128, return_sequences=True),\
        keras.layers.GRU(128),\
        keras.layers.Dense(1, activation = "sigmoid")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
history=model.fit(train_set, epochs=CONFIG_EPOCHS)

LOG:

root@nonroot-MS-7B22:~/dev-learn/gpu/tflow/tensorflow/tflow-2nded# python3 p545.py
2.6.0
2.6.0
Processing  p545.py
epochs:  5
batch_size:  32
X_train[:10]:  [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]
2021-11-09 14:07:29.646087: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:29.707025: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:29.707630: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:29.709409: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-09 14:07:29.711255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:29.711852: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:29.712408: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:30.852327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:30.852565: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:30.852764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-09 14:07:30.852934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6677 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5
2021-11-09 14:07:31.137935: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
most common words in vocabulary:  [(b'<pad>', 205484), (b'the', 61137), (b'a', 38564)]
2021-11-09 14:07:40.594000: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
Epoch 1/5
2021-11-09 14:07:46.078451: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
782/782 [==============================] - 12s 6ms/step - loss: nan - accuracy: 0.5000
Epoch 2/5
782/782 [==============================] - 5s 6ms/step - loss: nan - accuracy: 0.5000
Epoch 3/5
782/782 [==============================] - 5s 6ms/step - loss: nan - accuracy: 0.5000
Epoch 4/5
782/782 [==============================] - 5s 6ms/step - loss: nan - accuracy: 0.5000
Epoch 5/5
782/782 [==============================] - 5s 6ms/step - loss: nan - accuracy: 0.5000
root@nonroot-MS-7B22:~/dev-learn/gpu/tflow/tensorflow/tflow-2nded# git log