Network Appears To Stop Learning Early

ImMe · January 16, 2024, 3:14am

Hi, I recently became interested in machine learning, more specifically, neural networks.
I have gone over a bit of introductory material on the topic and I am attempting to test my understanding of it so far. The task I am attempting to accomplish is to get a model to predict the relationships between words in it’s input, however so far, it seems nothing I have tried
has yielded good results.

My short term goal is to understand how to get the model to at least be able to predict correctly for the training and validation data. Preferably with dense layers for now. I understand other layer types may be more suitable for sequential data, but since my sentences are relatively short, less than 15 words, I am thinking it should be possible with dense.

Currently, my model appears to get stuck at a relatively high loss.
Increasing, decreasing number of layers and or parameters, doesn’t seem to help much. It appears to me that most of my layers aren’t learning based on my interpretation of the histograms shown by TensorBoard, as the weight distribution appears to remain similar epoch to epoch.
Either that or it learns only on the first epoch and then barely does anything after that. I am guessing maybe it has something to do with my loss function, but I don’t see the issue yet.
Any suggestions on how I can resolve that issue?

Some information about my current setup:

Input: padded tokenized sentence.
I simply get a list of unique words for all sentences and map them to index for now.
I don’t use any punctuation in sentences.
My input data is generated artificially, as I only care about getting the model to successfully predict correctly, at least some of the training, and validation data for now.
I generate from templates similar to “Set an alarm {time} {day} to {task}”, “Remind me to {task} {time} {day} to {task}” etc.
For example, {time} can be replaced with things like, “noon”, “2pm”, “this afternoon”, etc.
When populating the templates, for each time and day, I add a mapping to the task(essentially, the task is related to time and day and vice versa. Note: for things like “this afternoon”, I only map “afternoon”.

Output: (max_input_length, max_input_length) matrix, where words(their tokens) that are related have a 1 at their intersecting indexes, otherwise 0

Model configuration: Not much significance as to why I have the activations I do, or the specific amount of dense layers,
just experimenting to see the effects of such.

input_layer = tf.keras.layers.Input(shape=(max_len,))
embedding_size = 32
embedding_layer = (tf.keras.layers.Embedding(len(vocab), embedding_size, name="embedding"))(input_layer)
hidden_layer = (tf.keras.layers.Dense(embedding_size, name="embedding_dense", activation="relu")(embedding_layer))
hidden_layer = (tf.keras.layers.Dense((max_len * max_len), activation="relu", name="dense_with_activation_1")(hidden_layer))
hidden_layer = tf.keras.layers.Flatten()(hidden_layer)
output_layer = tf.keras.layers.Dense((max_len * max_len), activation="tanh", name="dense_with_activation_2")(hidden_layer)
output_layer = tf.keras.layers.Reshape((max_len, max_len))(output_layer)
model = tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=tf.keras.optimizers.schedules.ExponentialDecay(0.1, 100, 0.96)),
    loss=loss)

Loss function: The idea is to have false negatives be weighted more heavily since most relationships should be negative.

def loss(y_true, y_pred):
    errors = tf.cast(tf.logical_and(tf.equal(y_true, 1), tf.not_equal(y_true, tf.round(y_pred))), dtype=tf.float32)
    num_1_errors = tf.reduce_sum(errors)
    original_loss = tf.abs(y_true - y_pred)
    scaled_loss = tf.sqrt(num_1_errors) * original_loss
    return tf.reduce_sum(scaled_loss)

Tim_Wolfe · January 27, 2024, 3:36am

Certainly! You’re facing a common issue in neural network training where your model isn’t effectively learning from the training data. Here are key points to consider for improvement:

Enhance Data Preprocessing: Explore advanced text representation techniques like word embeddings (Word2Vec, GloVe) instead of simple index mappings.
Adjust Model Architecture:

Re-evaluate the embedding layer size.
Consider using sequence-aware layers like RNNs, GRUs, or LSTMs, even for short sentences.
Experiment with different activation functions, possibly replacing ‘tanh’ with ‘sigmoid’ or ‘softmax’ for binary classification.

Simplify Loss Function: Start with a standard loss function like binary cross-entropy before moving to more complex custom functions.
Fine-Tune Learning Rate and Optimization: Adjust the initial learning rate and decay parameters in your Adam optimizer with exponential decay schedule.
Address Overfitting or Underfitting: Add dropout layers or regularization for overfitting; increase model complexity or training data for underfitting.
Monitor Additional Metrics: Use accuracy or F1 score alongside loss for a better understanding of model performance.
Systematic Experimentation: Make incremental changes and monitor their impact to identify beneficial adjustments.
Analyze TensorBoard Output: Investigate stagnant weight distributions, which could indicate issues with the learning rate or the model’s capacity to learn.

Overall, machine learning requires iterative experimentation, so keep trying different configurations to find what works best for your task.

ImMe · January 27, 2024, 6:37pm

Hi Tim_Wolfe, thanks for your response. I have a few more questions if you don’t mind.

I am using word embeddings currently I think, first I tokenize words using a simple unique word to index mapping like the following.

vocab = list(set(" ".join(sentences).split()))
vocab.sort()
vocab.insert(0, "<PAD>")
index_to_word = {i: word for i, word in enumerate(vocab)}
word_to_index = {word: i for i, word in enumerate(vocab)}
max_len = max([len(sentence.split()) for sentence in sentences])

However, my first layer of model is an embedding which accepts and re-represents those tokenized words before passing them on to new layers. It has changed a little since my initial post, this is what it looks like now.

input_layer = tf.keras.layers.Input(shape=(max_len,))
embedding_layer = (tf.keras.layers.Embedding(len(vocab), embedding_size, name="embedding", mask_zero=True))(input_layer)

Should that be sufficient for a toy example or do you see any issues with it? I was attempting to avoid using pretrained embeddings and subword tokenizers for simplicity and better understanding.

I have been trying different layer and embedding sizes, however, so far, haven’t found a combination which can atleast overfit the training data. I also tried RNN type layers temporarily, but it didn’t seem to produce correct output or learn to cluster the embeddings appropriately. The following was the last RNN type I attempted.

hidden_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_size * 2, return_sequences=True, name="bi_lstm"))(embedding_layer)

hidden_layer = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(embedding_size, activation="leaky_relu", name="time_distributed_dense"))(hidden_layer)

I also have tried playing with different activation functions, some appear to lower loss score better, some appear to result in better embedding clusterings, for example grouping tasks together days together etc. However, the output were still not desireable.

Initially, I was using built in loss functions, including binary cross-entropy, but I ran into issues such as the loss becoming very low, but predictions still being very wrong. I am assuming that is because most relationships would be 0 and so the model could probably get low loss by just predicting everything to be 0. My current loss function has changed a bit since my initial post aswell.
I attempted to do something similar to my understanding of cross-entropy but weighting false negatives
more harsly. Do you see any issues with this, or maybe there is a built in way to accomplish the same thing?

def loss(y_true, y_pred):
    weights_y_true_1 = 8
    weights_y_true_0 = 1
    safety = 1e-10

    weights = tf.add(tf.multiply(y_true, weights_y_true_1), tf.multiply(tf.subtract(1.0, y_true), weights_y_true_0))
    predictions_with_safety = tf.add(y_pred, safety)
    
    false_negative_loss = tf.multiply(y_true, tf.math.log(predictions_with_safety))
    false_positive_loss = tf.multiply(tf.subtract(1.0, y_true), tf.math.log(tf.subtract(1.0, predictions_with_safety)))
    loss = tf.add(tf.multiply(false_negative_loss, weights), tf.multiply(false_positive_loss, weights))
    return tf.negative(tf.reduce_mean(loss))

I have tried that, I decreased the exponential decay and increased number of steps between decays since it seemed like my gradients were decreasing a bit quickly, assuming I graphed them correctly that is.
I have tried adding various dropout with different ratios, but effects seem to be negligble so far. I still haven’t been able to get the model to overfit the training data yet, so guessing that might have something to do with it.
I have not tried this yet, will try looking into how to get those metrics.
Have been trying this aswell, sadly, haven’t found the correct settings yet.
By stagnant you mean that the distribtions don’t change much over time right? If so, I did have that problem at some point aswell, the recent ones appeart to change distribtions over time now, however, I still don’t fully understand what optimal distribtions should look like.

Is there any reason I should be unable to overfit atleast the training set of examples without using a sequence based model? I would like to understand why not if that’s the case. My latest idea was to try to have the relationships between each word be represented in the model by having to matrices like the following.

relation_layer = tf.keras.layers.Dense(embedding_size, activation="leaky_relu", name="relation_dense")(embedding_layer)
hidden_layer = tf.keras.layers.Dense(embedding_size, activation="leaky_relu", name="relation_dense_2")(embedding_layer)
hidden_layer = tf.keras.layers.LeakyReLU()(tf.keras.layers.Multiply()([relation_layer, hidden_layer]))
hidden_layer = tf.keras.layers.Dense(embedding_size, activation="leaky_relu", name="relation_dense_3")(hidden_layer)

The idea is that one of the matrices would encode how heavily one embedding should weighted relative to another. So for each embedding, there is a mapping to second matrix for each other embedding in other words. I added a non linearity to it since I am guessing it’s probably not a linear relationship.
What I was hoping would happing was the embedings would learn to cluster themselves, for example by dimensions that encode that this is task like or this is time like and then based on those clusterings, it would be able to predict since this is a time and this other word is a task, maybe they are related. Does this idea sound fine or is there something I haven’t considered?

I am also considering adding another output layer which predicts entity type and have that feed into the relationship predicting layers. Wondering if maybe it just needs some more help to contextualize stuff since the numbers fed into the embedding by themselves probably don’t carry much meaning on there own.

Does what I am attempting to do seem possible to you, if not can you explain why it is not?