How to apply Attention to BLSTM?

SnakeIsTheName · January 6, 2022, 4:29pm

I’ve scoured various tutorials on applying attention to a LSTM, but they either implement custom Attention layers, or use Keras/TF’s classes using examples that don’t relate to my study. I’m making a Bidirectional LSTM with Attention model for the sake of time series forecasting. After various attempts, I’ve landed on the following setup for which there are still problems needing ironing out:

input = Input(batch_input_shape=self.batch_input_shape)

layer = None

output = input

output = LayerNormalization(name="Normalize", epsilon=1e-7)(output)

cell = LSTMCell(

    name="LSTM",

    units=self.lstm_units,

    activation=self.activation,

    recurrent_activation=self.recurrent_activation,

    recurrent_regularizer=self.lstm_recurrent_regularizer,

    kernel_regularizer=self.lstm_kernel_regularizer,

    bias_regularizer=self.lstm_bias_regularizer,

    activity_regularizer=self.lstm_activity_regularizer,

    dropout=self.dropout,

    recurrent_dropout=self.recurrent_dropout,

)

if self.attention_type == "bahdanau":

    mechanism = BahdanauAttention(units=self.attention_units, memory=output)

else:

    mechanism = LuongAttention(units=self.attention_units, memory=output)

cell = AttentionWrapper(cell, mechanism, name="Attention", output_attention=False)

layer = RNN(cell, stateful=self.stateful, return_sequences=True)

output = Bidirectional(layer, name="Bidirectional")(output)

output = Dense(1, name="Reduce")(output)

Why do the attention mechanisms require the memory argument? According to the docs, it’s optional. The way I’ve set this up, it receives the normalized inputs whereas it should be receiving the LSTM hidden layer inputs, no?
The current error I’m receiving is:

TypeError: To be compatible with tf.eager.defun, Python functions must return zero or more Tensors; in compilation of <function while_loop..wrapped_body at 0x7f6ceb5386a8>, found return value of type <class ‘keras.engine.keras_tensor.KerasTensor’>, which is not a Tensor.

Is this even the optimal flow? Input > Normalize > LSTM > Attention > Bidirection > Dense
I’m using BahdanauAttention/LuongAttention with AttentionWrapper solely because of a Tensorflow example; but if there’s a way of using Keras’ Attention class in simpler fashion I’d be happy to learn how that works.

markdaoust · January 7, 2022, 1:41am

There’s an option to use this layer by passing the memory to call instead of init:

github.com

tensorflow/addons/blob/v0.15.0/tensorflow_addons/seq2seq/attention_wrapper.py#L212-L231

      
        
            if setup_memory:
                if isinstance(inputs, list):
                    if len(inputs) not in (1, 2):
                        raise ValueError(
                            "Expect inputs to have 1 or 2 tensors, got %d" % len(inputs)
                        )
                    memory = inputs[0]
                    memory_sequence_length = inputs[1] if len(inputs) == 2 else None
                    memory_mask = mask
                else:
                    memory, memory_sequence_length = inputs, None
                    memory_mask = mask
                self.setup_memory(memory, memory_sequence_length, memory_mask)
                # We force the self.built to false here since only memory is,
                # initialized but the real query/state has not been call() yet. The
                # layer should be build and call again.
                self.built = False
                # Return the processed memory in order to create the Keras
                # connectivity data for it.
                return self.values

The way I’ve set this up, it receives the normalized inputs whereas it should be receiving the LSTM hidden layer inputs, no?

IIUC The AttentionWrapper is implementing something similar to this decoder but wrapped up in the keras-RNN api.

“LSTM hidden layer inputs” is ambiguous to me. So I’m not 100% sure what you’re trying to do.

Remember that the AttentionWrapper architecture was designed for sequence to sequence type problems, and here it looks like you’re just using a single input sequence.

Looking at the code for AttentionWrapper, I see it’s doing what I expect: At each time step the output from the LSTM is returned (return_attention=False), and then used as the query for the attention. The output of the attention is stacked with the input of the next timestep.

github.com

tensorflow/addons/blob/v0.15.0/tensorflow_addons/seq2seq/attention_wrapper.py#L1986

      
        
            Returns:
              A tuple `(attention_or_cell_output, next_state)`, where:
            
            
  - `attention_or_cell_output` depending on `output_attention`.
              - `next_state` is an instance of `tfa.seq2seq.AttentionWrapperState`
                 containing the state calculated at this time step.
            
            
Raises:
              TypeError: If `state` is not an instance of `tfa.seq2seq.AttentionWrapperState`.
            """
            if not isinstance(state, AttentionWrapperState):
                try:
                    state = AttentionWrapperState(*state)
                except TypeError:
                    raise TypeError(
                        "Expected state to be instance of AttentionWrapperState or "
                        "values that can construct AttentionWrapperState. "
                        "Received type %s instead." % type(state)
                    )
            
            
# Step 1: Calculate the true inputs to the cell based on the

So the LSTM is seeing the input sequence both from the current time’s direct input and its attention over the whole input sequence.

TypeError: To be compatible with tf.eager.defun… <function while_loop… found return value of type <class ‘keras.engine.keras_tensor.KerasTensor’>, which is not a Tensor.

I think what this is trying to tell you is that you can’t use one of those RNN components with the keras-functional interface. Which line throws that error? Is it the RNN line?

Try building that as a Layer subclass instead of a keras-functional model.

Is this even the optimal flow? Input > Normalize > LSTM > Attention > Bidirection > Dense

I’m using BahdanauAttention/LuongAttention with AttentionWrapper solely because of a Tensorflow example; but if there’s a way of using Keras’ Attention class in simpler fashion I’d be happy to learn how that works.

Transformers work by applying attention layers without the RNNs. You can definitely stack Transformer and LSTM layers. Maybe that’s more inline with what you’re trying to do.

This tutorial implements the transformer layer from scratch:

But I should update it to use the new MultiHeadAttentionLayers from keras:

https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention?version=nightly