Tensorflow.js – Complete sentence / Predict next word in a sentence

cabada · June 1, 2021, 1:55am

I’m developing a Machine Learning “fortune teller” that completes sentences from seeds, but my trained algorithm seems very poor or even not usable.

It’s my first time training with tokenized words, and I’ve tried several approaches but non results in a good trained model.

Example of my dataset

This July money will be by your side.
The next days love will come at your door.
Be careful with your friend and money.
Love is going to be hard this year.
A lot of friends will come at your door.
Be powerful at your job it will result good this month.
... etc, etc.

I have:

21,885 unique sentences
1,523 unique words

How have I prepared my data?

I’ve sorted the unique words in alphabetic order and saved them into a 1,523-length array.

For example:

[a, at, be, by, come, door, good, job, lot, ....]

With this built array, I assign to every word of my data a numbered value.

For example:

Hello, my name is Carlos will be equal to [54, 504, 492, 394, 100, 150]

Supposing that in my dictionary Hello=54 ,=504 my=492 name=394 is=100 Carlos=150

Why not using a pre-trained model that completes sentences? Because I have special words that don’t exist in the dictionary and weird names.

Creating my inputs and outputs

Having tokenized my words, I’ve come up with a way to make same-length inputs for my model by grouping array of words into N-length entries.

I’ve decided to split my sentences in arrays of 4-items with an array of 2-items as output, for example:

Hello, my name is {Name} and I like apples.

Will result into:

Input                                Output

["Hello", ",", "name", "is"]         ["{Name}", "and"]
[",", "name", "is", "{Name}"]        ["and", "I"]
["name", "is", "{Name}", "and"]      ["I", "like"]
["is", "{Name}", "and", "I"]         ["like", "apples"]
["{Name}", "and", "I", "like"]       ["apples", "."]
["and", "I", "like", "apples"]       [".", "."]

Note: I start appending “.” at the end of output when the sentence has finished so the algorithm has a good knowledge of when a sentence ends.

Then I repeat these steps for all the 21,885 sentences I have.

And finally I substitute the word with the index of where it’s my word list, so real input/output looks like this.

Input                                Output

[400,500,390,293]                    [303, 442]

Training

My data on this example is constructed by Inputs of 4-length arrays and outputs of 2-length arrays.

Idea is to train an algorithm that given 4 words it can predict the next 2 words. (Used 4-length array for input for this example but I’ve tried multiple array lengths for the input)

Layers and model used so far

const model = tf.sequential();
model.add(tf.layers.dense({units: 100, inputShape: [4]}));
model.add(tf.layers.activation({activation: 'softmax'}));
model.add(tf.layers.dense({units: 2}));
model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.sgd(0.001) ,metrics: ['accuracy']});

//70% of the data used for trainning
const xs = tf.tensor2d(inputs, [inputs.length, inputs[0].length]);
const ys = tf.tensor2d(outputs, [outputs.length, outputs[0].length]);

//20% of the data used for validation
const xsVal = tf.tensor2d(inputsVal, [inputsVal.length, inputsVal[0].length]);
const ysVal = tf.tensor2d(outputsVal, [outputsVal.length, outputsVal[0].length]);

model.fit(xs, ys,{epochs: 100,batchSize:64, validationData: [xsVal, ysVal]}).then(async () => {
  const saveResult = await model.save('file://modelo2');
});

But I cannot get it train correctly, it loops in a step giving all the time the same loss= value.

I’ve also tried by changing the training learning rate, but I keep getting the same loop. Which makes me thing I’m not preparing my data correctly or I’m using a bad technique.

I’ve also tried to change the model of training to meanSquaredError, but same problem.

Second approach

Instead of converting the output as a tokenized value of the index of my word list. E.g. Hello=54, I’ve made a another approach to only predict the next word by making a 1,523-length array of [0,0,0,...1,0] with a 1 in the index of my word list.

For example, if I want to tokenize the output of the word website of my list ["car", "apple", "website", "orange", "red"] the output will be a [0, 0, 1, 0, 0] array

So now my data looks like:

Input                                Output

[400,500,390,293]                    [0,0,0,0,0,0,0,0,0,0,1]

But this also leads to a failed training.

What am I missing or doing wrong?

I think I have to achieve something like LSTM, but I haven’t trained one as such and I’m lost about how to prepare the data for the problem I need to solve and how to prepare the layers and model.

Can anyone suggest a better approach for this?

Thanks in advance.

Note: All data and tokens for this example have been made up to explain the problem.

lgusm · June 1, 2021, 5:57pm

@cabada , I don’t know if your model architecture might get good results.
I’d suggest you look into this tutorial: Text generation with an RNN | TensorFlow

The tutorial uses charprediction but I think you can adapt it to word prediction.

Jeff_Corpac · June 2, 2021, 3:59am

You’re on the right track with your sequences. I try to think of sequence training as “Given the last n elements, tell me what the n+1 element should be”. I’m not sure how having two outputs would affect the training, but I don’t see any problems with it, you’re just predicting the next two elements instead of the next one.

You’re right about wanting to try an LSTM layer (or GRU, they both seem to perform similar functions) which will consider not just the current input but also the ones before it. The problem with using dense layers for this is that they don’t consider ordering, so they’re just taking the collection of words rather than taking them in order.