How does masking work in Tensorflow Keras

I have difficulty understanding how exactly masking works in Tensorflow/Keras. On the Keras website (Keras でマスキングとパディングをする  |  TensorFlow Core) they simply say that the neural network layers skip/ignore the masked values but it doesn’t explain how? Does it force the weights to zero? (I know a boolean array is being created but I don’t know how it’s being used)

For example check this simple example:

tf.random.set_seed(1)

embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)
masked_output = embedding(np.array([[1,2,0]]))
print(masked_output)

I asked the Embedding layer to mask zero inputs. Now look at the output:

tf.Tensor(
[[[ 0.00300496 -0.02925059 -0.01254098]
  [ 0.04872786  0.01087702 -0.03656749]
  [ 0.00446818  0.00290152 -0.02269397]]], shape=(1, 3, 3), dtype=float32)

If you change the “mask_zero” argument to False you get the exact same results. Does anyone know what’s happening behind the scene? Any resources explaining the masking mechanism more thoroughly is highly appreciated.

P.S: This is also an example of a full Neural Network which give an identical outcome with and without masking:

tf.random.set_seed(1)
input = np.array([[1,2,0]]) # <--- 0 should be masked and ignored
embedding = tf.keras.layers.Embedding(input_dim=10, output_dim=3, mask_zero=True)
masked_output = embedding(input)
flatten = tf.keras.layers.Flatten()(masked_output)
dense_middle = tf.keras.layers.Dense(4)(flatten)
out = tf.keras.layers.Dense(1)(dense_middle)
print(out)

I looks to me above is ok. The code you show creates the structure of your model aka the graph. And if you run the command masked_output._keras_mask you’ll see the mask has been created: <tf.Tensor: shape=(1, 3), dtype=bool, numpy=array([[ True, True, False]])> and it will actually be used when you start using your graph.

As per as what is going on behind the scene when masking is being used, your mask actually defines some steps to skip (as they explain in the TF webpage you mentioned. So it doesn’t involve complex additional operations, as one can come to think.

Someone on stackexchange made the comment on can think of masking as a form of dropout where the output of a node is nullified.

Also, if useful, the TF team explains in the code definining masking:

Masks a sequence by using a mask value to skip timesteps.
For each timestep in the input tensor (dimension #1 in the tensor),
if all values in the input tensor at that timestep
are equal to `mask_value`, then the timestep will be masked (skipped)
in all downstream layers (as long as they support masking).
If any downstream layer does not support masking yet receives such
an input mask, an exception will be raised.
Example:
Consider a Numpy data array `x` of shape `(samples, timesteps, features)`,
to be fed to an LSTM layer. You want to mask timestep #3 and #5 because you
lack data for these timesteps. You can:
- Set `x[:, 3, :] = 0.` and `x[:, 5, :] = 0.`
- Insert a `Masking` layer with `mask_value=0.` before the LSTM layer:
```python
samples, timesteps, features = 32, 10, 8
inputs = np.random.random([samples, timesteps, features]).astype(np.float32)
inputs[:, 3, :] = 0.
inputs[:, 5, :] = 0.
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Masking(mask_value=0.,
                                  input_shape=(timesteps, features)))
model.add(tf.keras.layers.LSTM(32))
output = model(inputs)
# The time step 3 and 5 will be skipped from LSTM calculation.
```

Hopefully above is helpful.

Thanks but I still can’t see the math behind these operations. For me neural networks are all about matrix multiplication. Things that I want to know are:

  • 1- whether masking is done by forcing some weights to zero? Is this how dropout works?
  • 2- How is it different than simply using the padded values that are zero without masking (zero times anything is zero again, hence I suppose the answer to my 1st question should be no, because there is no point in forcing the weights to zero when the input is zero itself)
  • 3- I assume masking didn’t suddenly emerge out of thin air, so I want to know who invented the method? Are there any academic papers talking about it?