Problem with GRU Stacking in Text Generation Tutorial

dsr · June 9, 2021, 4:40pm

I am doing TensorFlow’s text generation tutorial and it says that a way to improve the model is to add another RNN layer.

The model in the tutorial is this:

class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

And I tried adding a layer doing this:

class MyModel(tf.keras.Model):
  def init(self, vocab_size, embedding_dim, rnn_units):
    super().init(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
    self.gru2 = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x, states = self.gru2(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

The accuracy during training is above 90% but the text that it generates is nonsensical.
What am I doing wrong?
How should I add that new layer?

Edit:
This is an example of the text generated:
Y gué el chme th ¡G : i uit: R dud d RR dududut ded,d!D! ties, is: y ui: iu,: ¡RRAShad wy…Ze…Zlegh Fither k.#É…WIkk.DR… t: W: R: IXII?IllawfGh…ZEWThedWe td y: W,Y,!:Z

Edit 2:
this is the tutorial I am following:

my code is essentially the same except for the new GRU layer

8bitmp3 · June 9, 2021, 9:45pm

RNN/LSTMs are not the current state-of-the-art for natural language generation. For example, in the Google AI Blog: Reformer: The Efficient Transformer post it says that:

In the language domain, long short-term memory (LSTM) neural networks cover enough context to translate sentence-by-sentence. In this case, the context window (i.e., the span of data taken into consideration in the translation) covers from dozens to about a hundred words. The more recent Transformer model not only improved performance in sentence-by-sentence translation, but could be used to generate entire Wikipedia articles through multi-document summarization. This is possible because the context window used by Transformer extends to thousands of words.

Have you tried any of the pre-trained Transformer-based models or trained one of your own?

You may find the following posts and research papers useful:

A Transformer Chatbot Tutorial with TensorFlow 2.0 — The TensorFlow Blog
T5: Google AI Blog: Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer (Colab - “Fine-Tuning the Text-To-Text Transfer Transformer (T5) for Closed-Book Question Answering”: Google Colab). arXiv - https://arxiv.org/abs/1910.10683 (2019)
Reformer: Google AI Blog: Reformer: The Efficient Transformer
ALBERT: Google AI Blog: ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations
BERT: Google AI Blog: Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
https://medium.com/deep-learning-with-keras/fundamentals-of-text-generation-745d66238a1f (includes a lot of reference links).

dsr · June 9, 2021, 10:41pm

Thank you for the links, I will look into them when I am more versed, right now I’m too noob for them.
Nonetheless more than with the result of the text itself I’m interested in knowing what I’m doing wrong. Probably I didn’t explain it well but the text it gives seem more like a random series of letters than the result it should give. It’s not like random phrases it’s just random letters.

Bhack · June 9, 2021, 11:04pm

Have you already tried to reproduce a very minimal char-level text generation with your data?

For a good step by step example see:

dsr · June 10, 2021, 10:57am

Yes, I have done simpler text-generator. Using just one GRU or LSTM it works fine with the Shakespeare dataset, but I added more data and the model needed more complexity to learn so I added another layer and there is where my problem started. I insist, is with the two GRU where my problem is. I just want to know how it is supposed to be implemented the new GRU layer.

fchollet · June 10, 2021, 11:55am

dsr:

    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x, states = self.gru2(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states

You treat GRU states as an intermediate output processed sequentially (you create some states, pass them to the first layer, get back new states, pass them to the second layer, then retrieve the final states…). But that’s not what RNN states are! They’re parallel, not sequential! The state of GRU2 is completely independent from the state of GRU1.

If you want to do stateful processing, then you should be loading both states independently at the start of each batch (and also returning both states at the end).

The good news is, a GRU has only has state vector, so the code will be a bit easier than if you were dealing with a LSTM, which has multiple state vectors. You’d do something like:

    if states is None:
      state1 = self.gru1.get_initial_state(x)
      state2 = self.gru2.get_initial_state(x)
    else:
       state1, state2 = states
    x, state1 = self.gru1(x, initial_state=state1, training=training)
    x, state2 = self.gru2(x, initial_state=state2, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, [state1, state2]

dsr · June 10, 2021, 12:50pm

I have done that and another problem emerged now, I dont know if you could help me with that.

Apparently now the shapes are incompatible:
ValueError: Shapes (64, 100) and (64, 100, 78) are incompatible

I don’t know where that disonance comes from. If it helps my dataset is like this:
<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

How could I fix this?

fchollet · June 10, 2021, 4:51pm

Hard to say since I don’t know what shapes these are.

I recommend switching your entire model to the Functional API, you’ll find it much easier to debug. There’s no justification for using Model subclassing here.

Basic version:

inputs = Input(shape=...)
x = layers.Embedding(vocab_size, embedding_dim)(inputs)
x = layers.GRU(rnn_units, return_sequences=True)(x)
x = layers.GRU(rnn_units, return_sequences=True)(x)
outputs = layers.Dense(vocab_size)(x)
model = Model(inputs, outputs)

Version that allows state reinjection:

inputs = Input(shape=...)
initial_gru1_state = Input(shape=...)
initial_gru2_state = Input(shape=...)
x = layers.Embedding(vocab_size, embedding_dim)(inputs)
x, state1 = layers.GRU(rnn_units, return_sequences=True, return_state=True)(x, initial_state=initial_gru1_state)
x, state2 = layers.GRU(rnn_units, return_sequences=True, return_state=True)(x, initial_state=initial_gru2_state)
outputs = layers.Dense(vocab_size)(x)
model = Model(inputs, [outputs, state1, state2])

Untested code obviously, but going this route (the standard route) will make your life much easier.

dsr · June 10, 2021, 7:46pm

It keeps giving me the same error:

ValueError: Shapes (64, 100) and (64, 100, 78) are incompatible

The 78 is the vocabulary size of the Embedding layer, 64 is the batch size and 100 is the sequence length.

8bitmp3 · June 10, 2021, 8:44pm

Will you be able to share your code here, so it’d be easier to debug? Maybe via GitHub or a Colab notebook.

dsr · June 11, 2021, 11:07am

I redid my dataset from the beginning and now works. I don’t know what I previously did in the dataset but probably I reshape it or change something I shouldn’t. But now it works fine. Thank you

dsr · June 11, 2021, 11:08am

I have change my dataset and now it works. Thank you very much.