How to apply a hierarchical mask in Tensorflow2.0 (tf.keras)

I am trying to build a hierarchical sequence model for time series classification (refer to the paper: hierarchical attention networks for document classification). But I’m very confused about how to mask the hierarchical sequences.

My data is a hierarchical time series. Specifically, each sample is composed of multiple sub-sequences and each sub-sequence is a multiple multivariate time series (just like word–> sentence -->document in NLP). So I need to pad and mask it twice. This is critical as a document will often not have the same number of sentences (or all sentences the same number of words). Finally, I get data as follows:

array([[[[0.21799476, 0.26063576],
         [0.2170655 , 0.53772384],
         [0.18505535, 0.30702454],
         [0.22714901, 0.17020395],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]],

        [[0.2160176 , 0.23789616],
         [0.2675753 , 0.21807681],
         [0.26932836, 0.21914595],
         [0.26932836, 0.21914595],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]]],

       [[[0.03941338, 0.3380829 ],
         [0.04766269, 0.3031088 ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]],

        [[0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]]]], dtype=float32)

Then I build a hierarchical model as follows:

inputs = Input(shape=(maxlen_event, maxlen_seq, 2))
x = TimeDistributed(
        Sequential([
            Masking(),
            LSTM(units=8, return_sequences=False)
        ])
    )(inputs)
x = LSTM(units=32, return_sequences=False)(x)
x = Dense(16, activation='relu')(x)
output = Dense(16, activation='sigmoid')(x)

As my data is padded in on both dimensions, I don’t know how to mask it correctly. I have two questions about it:
Q1: In TimeDistributed, do I use the masking layer correctly to mask the first padding?
Q2: How to mask the second padding?

Thank you.