How to apply Attention to BLSTM?

I’ve scoured various tutorials on applying attention to a LSTM, but they either implement custom Attention layers, or use Keras/TF’s classes using examples that don’t relate to my study. I’m making a Bidirectional LSTM with Attention model for the sake of time series forecasting. After various attempts, I’ve landed on the following setup for which there are still problems needing ironing out:

input = Input(batch_input_shape=self.batch_input_shape)

layer = None

output = input

output = LayerNormalization(name="Normalize", epsilon=1e-7)(output)

cell = LSTMCell(

    name="LSTM",

    units=self.lstm_units,

    activation=self.activation,

    recurrent_activation=self.recurrent_activation,

    recurrent_regularizer=self.lstm_recurrent_regularizer,

    kernel_regularizer=self.lstm_kernel_regularizer,

    bias_regularizer=self.lstm_bias_regularizer,

    activity_regularizer=self.lstm_activity_regularizer,

    dropout=self.dropout,

    recurrent_dropout=self.recurrent_dropout,

)

if self.attention_type == "bahdanau":

    mechanism = BahdanauAttention(units=self.attention_units, memory=output)

else:

    mechanism = LuongAttention(units=self.attention_units, memory=output)

cell = AttentionWrapper(cell, mechanism, name="Attention", output_attention=False)

layer = RNN(cell, stateful=self.stateful, return_sequences=True)

output = Bidirectional(layer, name="Bidirectional")(output)

output = Dense(1, name="Reduce")(output)
  1. Why do the attention mechanisms require the memory argument? According to the docs, it’s optional. The way I’ve set this up, it receives the normalized inputs whereas it should be receiving the LSTM hidden layer inputs, no?
  2. The current error I’m receiving is:

TypeError: To be compatible with tf.eager.defun, Python functions must return zero or more Tensors; in compilation of <function while_loop..wrapped_body at 0x7f6ceb5386a8>, found return value of type <class ‘keras.engine.keras_tensor.KerasTensor’>, which is not a Tensor.

  1. Is this even the optimal flow? Input > Normalize > LSTM > Attention > Bidirection > Dense
  2. I’m using BahdanauAttention/LuongAttention with AttentionWrapper solely because of a Tensorflow example; but if there’s a way of using Keras’ Attention class in simpler fashion I’d be happy to learn how that works.

There’s an option to use this layer by passing the memory to call instead of init:

The way I’ve set this up, it receives the normalized inputs whereas it should be receiving the LSTM hidden layer inputs, no?

IIUC The AttentionWrapper is implementing something similar to this decoder but wrapped up in the keras-RNN api.

“LSTM hidden layer inputs” is ambiguous to me. So I’m not 100% sure what you’re trying to do.

Remember that the AttentionWrapper architecture was designed for sequence to sequence type problems, and here it looks like you’re just using a single input sequence.

Looking at the code for AttentionWrapper, I see it’s doing what I expect: At each time step the output from the LSTM is returned (return_attention=False), and then used as the query for the attention. The output of the attention is stacked with the input of the next timestep.

So the LSTM is seeing the input sequence both from the current time’s direct input and its attention over the whole input sequence.

TypeError: To be compatible with tf.eager.defun… <function while_loop… found return value of type <class ‘keras.engine.keras_tensor.KerasTensor’>, which is not a Tensor.

I think what this is trying to tell you is that you can’t use one of those RNN components with the keras-functional interface. Which line throws that error? Is it the RNN line?

Try building that as a Layer subclass instead of a keras-functional model.

  1. Is this even the optimal flow? Input > Normalize > LSTM > Attention > Bidirection > Dense
  2. I’m using BahdanauAttention/LuongAttention with AttentionWrapper solely because of a Tensorflow example; but if there’s a way of using Keras’ Attention class in simpler fashion I’d be happy to learn how that works.

Transformers work by applying attention layers without the RNNs. You can definitely stack Transformer and LSTM layers. Maybe that’s more inline with what you’re trying to do.

This tutorial implements the transformer layer from scratch:

But I should update it to use the new MultiHeadAttentionLayers from keras:

https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention?version=nightly