**I want to implement Attention on LSTM layer. Let me put the detail description:**

we analyze the 8 hidden states of the LSTM that represent the embeddings for the different parts of an input frame. We consider the first 7 hidden states as the historical temporal

context and learn 7 weights corresponding to these states:

past context = [h1;h2;h3;:::h8] (1)

current = h8 (2)

transformed context = tanh(W1 ×past context + b1) (3)

weights = softmax(W2 ×transformed context + b2) (4)

final embedding = past context×weights + current (5)

b1 and b2 denote the biases in the two linear layers, and W1 and

W2 represent the 2D matrices in the linear layers. We initially

apply a linear transformation accompanied by a tanh linearity

transforming each of these seven vectors of size 128 into seven

new vectors of size 128 (Eq. 3). Another linear transformation

converts these 8 vectors each to size 1 essentially giving us

scores for each of the hidden states. These scores are then

passed through a softmax to give the final set of weights (Eq.

4). These weights are used to calculate a weighted sum of all

the 8 hidden states to give the final embedding for the past

context. This past context is added to the last hidden state

to give the final embedding for the input frame (Eq. 5). This

final embedding is used for classification.

**Please verify my code according to description. Is it right?**

```
from tensorflow.keras.layers import Input, Dense, Lambda, Dot, Activation, Concatenate
from tensorflow.keras.layers import Layer
import tensorflow as tf
def attention(lstm_hidden_status): # Tensor("lstm_1/transpose_1:0", shape=(?, 8, 128), dtype=float32)
hidden_size = lstm_hidden_status.get_shape().as_list() # get all dimensions all list
hidden_size = int(hidden_size[2]) # 128
# feed to Forward Neural Network
h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(lstm_hidden_status) # Tensor("last_hidden_state/strided_slice:0", shape=(?, 128), dtype=float32)
transformed_context = Dense(hidden_size, use_bias=True, activation='tanh', name='transformed_context_vec')(
lstm_hidden_status) # Tensor("transformed_context_vec/Tanh:0", shape=(?, 8, 128), dtype=float32)
score = Dot(axes=[1, 2], name='attention_score')([h_t, transformed_context]) # Tensor("attention_score/Squeeze:0", shape=(?, 8), dtype=float32)
attention_weights = Dense(8, use_bias=True, activation='softmax', name='attention_weight')(score) # Tensor("attention_weight/Softmax:0", shape=(?, 8), dtype=float32)
context_vector = Dot(axes=[1, 1], name='context_vector')([lstm_hidden_status, attention_weights]) # Tensor("context_vector/Squeeze:0", shape=(?, 128), dtype=float32)
new_context_vector = context_vector + h_t # Tensor("add:0", shape=(?, 128), dtype=float32)
return new_context_vector
```

Specifically, I am confused here in line `score = Dot(axes=[1, 2], name='attention_score')([h_t, transformed_context])`

, Why we are taking Dot product? All the debug outputs are attached with each line.