Batch norm on variable length sequences

Batch norm as implemented in Keras (and also most other frameworks, like PyTorch) does not take masking or padding into account, i.e. it incorrectly also uses those frames to calculate the statistics.

See for example here or here for some more references.

Did anyone ever tested the influence of this in comparison to using proper masking when calculating the statistics?