I am going through the Transformer code on tensorflow.org .
def create_masks(self, inp, tar): # Encoder padding mask (Used in the 2nd attention block in the decoder too.) padding_mask = create_padding_mask(inp) # Used in the 1st attention block in the decoder. # It is used to pad and mask future tokens in the input received by # the decoder. look_ahead_mask = create_look_ahead_mask(tf.shape(tar)) dec_target_padding_mask = create_padding_mask(tar) look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask) return padding_mask, look_ahead_mask
Transformer class has a method called create_masks which creates padding and look ahead mask. I understand that padding mask for encoder should take input sequence(input to the encoder) for creating padding mask. However, what I do not understand is why should the input sequence to the encoder should be used for creating padding mask for second attention block of the decoder(first line of the code). I think the padding mask for decoder should be created using the target sequence(which is fed to the decoder).
Please help me understand why this is done.