Character Level Text Generation Model

Aflah · June 21, 2022, 3:31pm

I’m trying to create a text generation model which learns from character level knowledge. I’ve managed to tokenize and build a tf dataset similar to what is present in this tutorial - Text generation with an RNN | TensorFlow

However instead of vectorization and having something like ids_from_chars I’ve used unicode code points instead so a sample from the dataset obtained by the following piece of code-

for input_example, target_example in unicode_encoded_dataset.take(1):
    print("Input :", input_example.numpy())
    print("Target:", target_example.numpy())

Looks like this -

Input : [112 114 101 102  97  99 101  32  32  32 115 117 112 112 111 115 105 110
 103  32 116 104  97 116  32 116 114 117 116 104  32 105 115  32  97  32
 119 111 109  97 110  45  45 119 104  97 116  32 116 104 101 110  63  32
 105 115  32 116 104 101 114 101  32 110 111 116  32 103 114 111 117 110
 100  32 102 111 114  32 115 117 115 112 101  99 116 105 110 103  32 116
 104  97 116  32  97 108 108  32 112 104]
Target: [114 101 102  97  99 101  32  32  32 115 117 112 112 111 115 105 110 103
  32 116 104  97 116  32 116 114 117 116 104  32 105 115  32  97  32 119
 111 109  97 110  45  45 119 104  97 116  32 116 104 101 110  63  32 105
 115  32 116 104 101 114 101  32 110 111 116  32 103 114 111 117 110 100
  32 102 111 114  32 115 117 115 112 101  99 116 105 110 103  32 116 104
  97 116  32  97 108 108  32 112 104 105]

I wish to create a LSTM model similar to the one present here - Character-level text generation with LSTM
However being new to tf datasets I’m not able to figure out how to get the right shapes as I keep getting errors which I think are due to the batch sizes.

After batching and shuffling my dataset is like this -

(<PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int32, name=None), TensorSpec(shape=(64, 100), dtype=tf.int32, name=None))>,

I wish to teach the model to generate the most probable next unicode code point at each timestep limited by the max codepoint (236 in my dataset) so essentially predict between 0 to 236

model = keras.Sequential(
    [
        keras.Input(shape=(64, 100)),
        layers.LSTM(128),
        layers.Dense(237, activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)

epochs = 40

for epoch in range(epochs):
    model.fit(unicode_encoded_dataset, epochs=1)
    print()
    print("Generating text after epoch: %d" % epoch)

I get -

    ValueError: Input 0 of layer "sequential_4" is incompatible with the layer: expected shape=(None, 64, 100), found shape=(64, 100)

Can someone help me figure out how to fix this issue

lgusm · June 27, 2022, 10:43am

maybe @markdaoust may be able to help here

Aflah · June 27, 2022, 10:57am

Hey should’ve mentioned i was able to figure it out forgot to close it.

Aflah · June 27, 2022, 10:57am

Will be out as a guide soon on keras.io

Aflah · June 27, 2022, 10:59am

Sorry keyboard acted weird lol

markdaoust · June 27, 2022, 3:57pm

i was able to figure it out forgot to close it.

It’s still always a good idea to post what the solution was, for anyone with the same problem.

I think it was this line:

keras.Input(shape=(64, 100)),

Don’t include the batch size when passing input-shapes to keras.