MirroredStrategy dict-of-tensors Metric - Cryptic Error?

mwalmsley · November 4, 2022, 12:28pm

Hi folks. I have a cryptic bug I could use some help puzzling out, if anyone has time?

I have a custom Metric function which returns a dict of tensors.

  def result(self):

    metric_result = {}  # dict of tensors to return
   # question_weights is dict of named tensors, made in __init__
    for weight in self.question_weights.values():  
      result[weight.name] = weight

    return metric_result

When (and only when) using distributed training, I get the error:

    .../keras/utils/metrics_utils.py", line 177, in merge_fn_wrapper  **
        return tf.identity(result)

    **TypeError: Expected any non-tensor type, but got a tensor instead.**

I’m using custom code, but it

Works correctly on one GPU
Works correctly on multiple GPUs when returning a single tensor (aka metric_result[some_key])

It only fails when returning a dict of tensors on multi-GPU.

A few obvious things I’ve checked:

The Metric is defined with the MirroredStrategy() context manager:


    with context_manager:

...
        # be careful to define this within the context_manager, so it is also mirrored if on multi-gpu
        extra_metrics = [
            custom_metrics.LossPerQuestion(
                name='loss_per_question',
                question_index_groups=schema.question_index_groups
            )
        ]

    model.compile(
        loss=loss,
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
        metrics=extra_metrics
    )

tf.print shows the dict looking okay, by eye:


{'questions/question_0_loss:0': 2.99033308,
 'questions/question_1_loss:0': 0.723048568,
 'questions/question_2_loss:0': 0.846625209,
...

Any idea what I’m missing here?

Using TF 2.10.0 (latest stable)

mwalmsley · November 5, 2022, 2:29pm

Raised as an issue here