I have a dataset of some 150k records with two columns, a column with text in Igbo language and a column with the sentiment of the text (-1=negative, 0=neutral, 1=positive).
From my research, and I’m not sure if I done it well, I saw that using an LSTM (or BiDirectional LSTM) would be my best choice. I followed a few tutorials and this is what I came up with:
def LSTM_model(): inp = Input(shape=(MAXLEN, )) x = Embedding(vocab_size, 20, input_length=MAXLEN, trainable=True)(inp) x = Dropout(0.25)(x) x = LSTM(512, dropout=0.2, recurrent_dropout=0.2)(x) x = Dense(32, activation='relu')(x) x = Dropout(0.25)(x) x = LSTM(256, dropout=0.2, recurrent_dropout=0.2)(x) x = Dense(32, activation='relu')(x) x = Dense(3, activation='softmax')(x) model = Model(inputs=inp, outputs=x) model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['sparse_categorical_accuracy']) return model
Before that, I tokenized the text, converted to sequence and applied padding. I used the sentiment column as my label column. vocab_size is just the length of the tokenizer word_index + 1. I used Adam optimizer and a sparse_categorical_crossentropy loss (because I have three categories and they are digits). Am I using the correct loss?
Am I on the right track here? This is just a test function, I probably will need more layers etc. but I just wanted to see if this model would fit my use-case (sentiment analysis).