Keras: slow startup for large models due to metric-initialization

dsuess · August 19, 2021, 7:39am

When creating large models (couple thousands nodes) in graph mode, initializing the metrics can take a very long time. The following toy example takes ~30 seconds on my machine (TF 2.6) to start training:

import tensorflow as tf
import numpy as np
from tensorflow.python.keras import backend as K


with K.get_session() as sess:
    print("DEF")
    model = tf.keras.Sequential(
        [tf.keras.layers.Dense(1) for _ in 500]
    )
    print("METRICS")
    metrics = [tf.keras.metrics.Accuracy(str(i)) for i in range(100)]

    print("COMPILE")
    model.compile(loss="mse", metrics=metrics, run_eagerly=False)
    x, y = np.zeros((2, 1000), dtype=np.float32)
    print("FIT")
    model.fit(x=x, y=y)

Most of the startup time is spend in this loop initializing the metrics.

In the actual model I am currently investigating, startup takes ~20 minutes since it’s quite a large model with data loading included in the graph and ~400 metrics. The latter is due to having 4 per-class metrics for ~100 classes. This time quadruples when adding another GPU with MirroredStrategy. What could I do to improve startup time in this case? So far, I’ve tried:

running in eager mode, which works fine on a single GPU, but scaling out is going to be more challenging
Creating one metric-class for all classes so that I only need to register 4 metrics. But it doesn’t seem to be possible for metrics to return arrays.

dsuess · August 20, 2021, 4:24am

Turns out it’s only a problem with Tensorflow 1.x graph mode. Removing the line with K.get_session() as sess: fixes it.