Is tensorflow multi-head attention layer autoregressive? e.g. “tfa.layers.MultiHeadAttention”

I looked at the difference between an autoregressive vs non-autoregressive in transformer architecture. but I am wondering whether the attention layer in TensorFlow is actually autoregressive? or do I need to implement the autoregressive mechanism?

I don’t see any option for causal (e.g. causal=true/false) or whether “tfa.layers.MultiHeadAttention” is autoregressive or not

Any thoughts on that would be appreciated.

use_causal_mask : Boolean. Set to True for decoder self-attention. Adds a mask such that position i cannot attend to positions j > i. This prevents the flow of information from the future towards the past. Defaults to False.

You can set use_causal_mask=True using tf.keras.layers.Attention. Thank you