Can't get action masking to work with PPO agent from tf-agents?

Hello, I am trying to train a PPO agent from the tf-agents library on a dataset consisting of graphs of different sizes. To handle the different sizes I am using padding / action masking via the ‘mask_splitter_network.MaskSplitterNetwork’ class for both the actor and value network.

Everything seems to work fine until after the first training loop. After that the actor network then starts outputting illegal actions (discrete action space).

i.e the input mask might be [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0] but the logits turn out to be something like
[-3.4e+38, -3.4e+38, -3.4e+38, …, -3.4e+38, 0.22, 0.256, 0.2, …, -3.4e+38]

Here’s code for the networks:

# Create the masked network for the PPO class
wrap_actor_net = actor_distribution_network.ActorDistributionNetwork(
    input_tensor_spec=train_env.observation_spec()['observation'],
    output_tensor_spec=train_env.action_spec(),
    fc_layer_params=(100,),
    activation_fn=tf.keras.activations.tanh)

actor_net = mask_splitter_network.MaskSplitterNetwork(
    splitter_fn,
    wrap_actor_net,
    passthrough_mask=True, )

wrap_value_net = value_network.ValueNetwork(
    input_tensor_spec=train_env.observation_spec()['observation'],
    fc_layer_params=(100,),
    activation_fn=tf.keras.activations.tanh, )

value_net = mask_splitter_network.MaskSplitterNetwork(
    splitter_fn,
    wrap_value_net,
    passthrough_mask=False, )

The splitter function is simply:
def splitter_fn(observation_and_mask):
return observation_and_mask[‘observation’], observation_and_mask[‘mask’]

Again this error only happens after the first training loop. Before that, the network does only output legal actions.

Any help is greatly appreciated! Thanks.