A couple of silly questions:
Let’s say I am using MultiHeadAttention in encoder-decoder architecture. In this case, the decoder part of it generates a single output token at a time. If this is correct, then why does the documentation say this:
attention_outputThe result of the computation, of shape
(B, T, E), where
Tis for target sequence shapes and
Eis the query input last dimension
Is T == 1?
Also, how does
use_causal_mask work? During inference, in order for TF to know what to mask, it has to know the current time-step (how many output tokens has already been generated) to mask the rest, right? Where does it get the information?