Temporal Attention on LSTM Layer?

Nafees · November 5, 2021, 2:54am

I want to implement Attention on LSTM layer. Let me put the detail description:

we analyze the 8 hidden states of the LSTM that represent the embeddings for the different parts of an input frame. We consider the first 7 hidden states as the historical temporal
context and learn 7 weights corresponding to these states:

past context = [h1;h2;h3;:::h8] (1)

current = h8 (2)

transformed context = tanh(W1 ×past context + b1) (3)

weights = softmax(W2 ×transformed context + b2) (4)

final embedding = past context×weights + current (5)

b1 and b2 denote the biases in the two linear layers, and W1 and
W2 represent the 2D matrices in the linear layers. We initially
apply a linear transformation accompanied by a tanh linearity
transforming each of these seven vectors of size 128 into seven
new vectors of size 128 (Eq. 3). Another linear transformation
converts these 8 vectors each to size 1 essentially giving us
scores for each of the hidden states. These scores are then
passed through a softmax to give the final set of weights (Eq.
4). These weights are used to calculate a weighted sum of all
the 8 hidden states to give the final embedding for the past
context. This past context is added to the last hidden state
to give the final embedding for the input frame (Eq. 5). This
final embedding is used for classification.

Please verify my code according to description. Is it right?

from tensorflow.keras.layers import Input, Dense, Lambda, Dot, Activation, Concatenate
from tensorflow.keras.layers import Layer
import tensorflow as tf


def attention(lstm_hidden_status):  # Tensor("lstm_1/transpose_1:0", shape=(?, 8, 128), dtype=float32)
    hidden_size = lstm_hidden_status.get_shape().as_list()  # get all dimensions all list
    hidden_size = int(hidden_size[2]) # 128
    # feed to Forward Neural Network
    h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(lstm_hidden_status) # Tensor("last_hidden_state/strided_slice:0", shape=(?, 128), dtype=float32)
    transformed_context = Dense(hidden_size, use_bias=True, activation='tanh', name='transformed_context_vec')(
        lstm_hidden_status) # Tensor("transformed_context_vec/Tanh:0", shape=(?, 8, 128), dtype=float32)

    score = Dot(axes=[1, 2], name='attention_score')([h_t, transformed_context]) # Tensor("attention_score/Squeeze:0", shape=(?, 8), dtype=float32)
    attention_weights = Dense(8, use_bias=True, activation='softmax', name='attention_weight')(score) # Tensor("attention_weight/Softmax:0", shape=(?, 8), dtype=float32)
    context_vector = Dot(axes=[1, 1], name='context_vector')([lstm_hidden_status, attention_weights]) # Tensor("context_vector/Squeeze:0", shape=(?, 128), dtype=float32)
    new_context_vector = context_vector + h_t # Tensor("add:0", shape=(?, 128), dtype=float32)
    return new_context_vector

Specifically, I am confused here in line score = Dot(axes=[1, 2], name='attention_score')([h_t, transformed_context]), Why we are taking Dot product? All the debug outputs are attached with each line.