Masking propagation through layers

I’m confused about the handling of mask since there seems confict to me.

So based on the above two points, my understanding is only if there is a mask argument in the __call__ method is the layer capable of handling masks. In addition, if there are layers that don’t have this argument when there is a masking layer in upstream, there will be an exception.

But the following example (modified from this example) doesn’t follow the above understanding as there’s no exception throwed, given that the call method doesn’t have mask (it does have attention_mask, but this is distinct)

import tensorflow as tf
import numpy as np
import tensorflow_models as tfm

samples, timesteps, features = 32, 10, 8
inputs = np.random.random([samples, timesteps, features]).astype(np.float32)
inputs[:, 3, :] = 0.
inputs[:, 5, :] = 0.

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Masking(mask_value=0.,
                                  input_shape=(timesteps, features)))
model.add(tfm.nlp.models.TransformerEncoder(
    num_layers=1,
    num_attention_heads=2,
    intermediate_size=16,
))

output = model(inputs)

Why isn’t here an exception raised? Is the masking layer actually working?

Any help will be appreciated!