MultiHeadAttention output shape & use_causal_mask

A couple of silly questions:

Let’s say I am using MultiHeadAttention in encoder-decoder architecture. In this case, the decoder part of it generates a single output token at a time. If this is correct, then why does the documentation say this:

attention_output The result of the computation, of shape (B, T, E), where T is for target sequence shapes and E is the query input last dimension

Is T == 1?

Also, how does use_causal_mask work? During inference, in order for TF to know what to mask, it has to know the current time-step (how many output tokens has already been generated) to mask the rest, right? Where does it get the information?