When creating large models (couple thousands nodes) in graph mode, initializing the metrics can take a very long time. The following toy example takes ~30 seconds on my machine (TF 2.6) to start training:
import tensorflow as tf import numpy as np from tensorflow.python.keras import backend as K with K.get_session() as sess: print("DEF") model = tf.keras.Sequential( [tf.keras.layers.Dense(1) for _ in 500] ) print("METRICS") metrics = [tf.keras.metrics.Accuracy(str(i)) for i in range(100)] print("COMPILE") model.compile(loss="mse", metrics=metrics, run_eagerly=False) x, y = np.zeros((2, 1000), dtype=np.float32) print("FIT") model.fit(x=x, y=y)
Most of the startup time is spend in this loop initializing the metrics.
In the actual model I am currently investigating, startup takes ~20 minutes since it’s quite a large model with data loading included in the graph and ~400 metrics. The latter is due to having 4 per-class metrics for ~100 classes. This time quadruples when adding another GPU with
MirroredStrategy. What could I do to improve startup time in this case? So far, I’ve tried:
- running in eager mode, which works fine on a single GPU, but scaling out is going to be more challenging
- Creating one metric-class for all classes so that I only need to register 4 metrics. But it doesn’t seem to be possible for metrics to return arrays.