LSTM model for sentiment analysis


I have a dataset of some 150k records with two columns, a column with text in Igbo language and a column with the sentiment of the text (-1=negative, 0=neutral, 1=positive).

From my research, and I’m not sure if I done it well, I saw that using an LSTM (or BiDirectional LSTM) would be my best choice. I followed a few tutorials and this is what I came up with:

def LSTM_model():
    inp = Input(shape=(MAXLEN, ))
    x = Embedding(vocab_size, 20, input_length=MAXLEN, trainable=True)(inp)
    x = Dropout(0.25)(x)
    x = LSTM(512, dropout=0.2, recurrent_dropout=0.2)(x)
    x = Dense(32, activation='relu')(x)
    x = Dropout(0.25)(x)
    x = LSTM(256, dropout=0.2, recurrent_dropout=0.2)(x)
    x = Dense(32, activation='relu')(x)
    x = Dense(3, activation='softmax')(x) 
    model = Model(inputs=inp, outputs=x)
    return model

Before that, I tokenized the text, converted to sequence and applied padding. I used the sentiment column as my label column. vocab_size is just the length of the tokenizer word_index + 1. I used Adam optimizer and a sparse_categorical_crossentropy loss (because I have three categories and they are digits). Am I using the correct loss?

Am I on the right track here? This is just a test function, I probably will need more layers etc. but I just wanted to see if this model would fit my use-case (sentiment analysis).

Hi Callum,

I think you’re in the right direction but I’d suggest you try this tutorial: Text classification with an RNN  |  TensorFlow

it will give you some insights and maybe you can even adapt your model a little bit.

1 Like

Thank you for the reply.

I have a couple of questions I hope you or someone can help me with. So after a bit more researched, this is what I have:

def LSTM_model():

    inp = Input(shape=(MAXLEN,)) # check here
    x = Embedding(vocab_size, X_text.shape[1], trainable=False, mask_zero=True)(inp)
    x = Dropout(0.25)(x)
    x = Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))(x)
    x = Bidirectional(LSTM(32))(x)
    x = Dense(16, activation='relu')(x)
    x = Dropout(0.2)(x)
    x = Dense(3)(x) 

    model = Model(inputs=inp, outputs=x)

    return model

1- What does the Embedding layer usually take? At the moment, it’s vocab_size (len of tokenizer.word_index), second dimension of vectorized text column (text_to_sequence and padding applied), and a couple of more arguments.

Now I’ve seen several variations where an input_length (equal to MAXLEN, others equal to shape of vectorized text column) is added to the Embedding layer, other variations that have MAXLEN (maximum number of words in one sentence in my column) as the output_dim, etc. It’s a bit confusing and I’m not sure which is correct…

2- The first input layer takes (MAXLEN,) as shape. Is that correct?

3- Are there any improvements someone can suggest?

Before passing data to the model you converted texts to tokens, cut long sequences to the maximum length and padded short sequences to the same length making all texts equally sized. That’s why Input layer should have shape equal to MAXLEN.
Embedding layer takes sequences of tokens (meaning sequences of integers, where each integer represents a word) and converts them into multi-dimensional vector representations. First argument is the number of words in the dictionary created by the tokenizer (including out-of-vocabulary token).
Second positional argument of the Embedding layer defines how many dimensions each vector would have. You can increase this number, if the model does not perform well, or decrease it, if the task is relatively simple.
Embedding layer should be trainable. It should learn how to transform tokens into vectors.
In a classification task numeric labels normally start at 0 and increase at a step of 1. In your example texts are labeled -1, 0 and 1, which is more like a regression task.
I’m not sure that the first Dropout layer before LSTM is necessary. You can loose some meaningful data there.


Thank you so much for the detailed answer.

I relabeled my text to have negative sentiment as 2 instead of -1, which should be fine, right?

A couple of more questions if I may:

1- What does having input_length in the Embedding layer do in my case? Is it necessary or would it be redundant? From the documentation, they say it’s only necessary if I have a Flatten layer after the Embedding layer, which I’m assuming I don’t need here?

2- Is the SparseCategoricalCrossentropy I used here correct? I’m having a bit of difficulty choose this over CategoricalCrossentropy.

3- Would BatchNormalization help here and where would it fit? I’ve see examples where it sits just before the first BiLSTM layer.

Defining vocabulary size and output dimension is required in an Embedding layer. Other arguments are optional. You can experiment and see if they make a difference in your case.
If the targets are represented by a one-hot encoded matrix, the loss is categorical crossentropy. If the targets are just one column with various labels, the loss is sparse categorical crossentropy.
BatchNormalization is most often used in convolutional neural networks. It is also used between a dense layer with linear activation and following non-linear activation function.


Thank you again for the great answer, my model is improving bit by bit just from your input.

Last two questions, I promise:

1- I’ve seen LSTM models for sentiment analysis that have a Conv1D layer or two before the BiLSTM layers. How would that fit here? I was under the impression convolutional layers are used in image, video, audio, etc. classification/recognition not in text.

2- I’ve added l2 regularizers in both LSTM layers (kernel and bias). These help with overfitting and whatnot but would adding them cause any loss of patterns or info? Same for kernel and bias constraints, which is still unclear for me. Would adding the regularizers and constraints with the dropouts be overkill?

You asked about BatchNormalization layer. It is more often used in image classification models in combination with convolutional layers. It can be also used after dense layer before the activation function, which in this case is represented as a separate layer, not as an argument of the dense layer. I did not see it applied to text classification. Probably someone else could comment here. Anyway you can try it.
I think it’s best to start with a simple model without too much regularizers and then add dropouts and other methods one by one. This way you’ll see if some particular technique makes the accuracy better or worse.
Or you can define all possible combinations of what you’d like to try and use KerasTuner to search for a best combination of hyper-parameters.

1 Like

I appreciate all the help. Keras Tuners looks like something I will need to achieve the best results.

When I save the best model as .h5 for example, and then load it, would I need to tokenize the text input and convert them to sequences before predicting, or can it just take the string input and it returns the scores/logits?

Sorry for hijacking the thread, but as per the model you gave in the original question you need to do the tokenization/preprocessing before passing it to the model even after saving.

The embedding layer expects the Input to be tokenized.

I’ve encountered an odd issue: I save the model (in .h5 after early stopping and checkpoint), and load it using tf.keras.models.load_model() in another notebook session. When I give a list of strings to predict on (after fitting on text, tokenizing and converting to sequence), it returns the same output every time on very different sentences, then throws a retracing warning after a while.

What could be the problem here? When the model was first run and completed, I predicted a few text and all was well. When I save it then load it back, it doesn’t work.

You should either save the tokenizer in a separate file after you initialized it the first time and called .adapt() or include the tokenizer as a layer in your model to ensure that the data is tokenized in exactly the same way every time you use the model.

1 Like

The TextVectorization layer, correct? So between the Input layer and the Embedding layer, I add the text vectorization layer like this:

vectorize_layer = TextVectorization(
   max_tokens=vocab_size, # number of words (len(word_index) + 1)
   output_mode='int', # assuming 'int' because labels are 0,1,2
   output_sequence_length=MAXLEN # same as tokenizer
x = vectorize_layer

and this is how I set up my tokenizer:

tokenizer = text.Tokenizer() 
text_tokenized = tokenizer.texts_to_sequences(text)
X_text = sequence.pad_sequences(text_tokenized, maxlen=MAXLEN)

How can the padding sequence be included in the layer or is there even a need? There’s a pad_to_max_tokens but I don’t think it works for ‘int’ output_mode.

Also, if I add that layer, do I have to tokenize, vectorize, etc. the text I want to predict or can I just pass the string and it does it through the layer?

EDIT: I pickled the tokenizer object then load it back before prediction, and it worked. Still would prefer a more convenient method.

Don’t want to bump the thread too much, but like mentioned before, pickling the tokenizer object works when I load the model in another session. I just want to know if there’s a better way for this, such as having a layer in the model definition that does all that without having to load the tokenizer every time.

According to the documentation, padding to maximum length is not possible, when the TextVectorization layer output mode is ‘int’. So you have to use this layer separately from the main model and then apply tf.keras.preprocessing.sequence.pad_sequences to the input data.
Probably, you could create a custom padding layer by subclussing keras.layers.Layer and using sequence.pad_sequences in the call method. I did not try it, but if it works, all the layers could be saved as a single model in one file.