I have a few questions about the
steps_per_execution argument in the Keras
- Why should this argument not always be set to a very high number?
- What impact does setting
steps_per_executionto a high number have on memory, CPU, and device resource utilization?
- Are there any concerns about model accuracy when using a very high
steps_per_execution, or will models with different
steps_per_executionvalues always converge to the same metrics? (In contrast, very large batch sizes can negatively impact model performance, as discussed in this discussion and paper.)
- For distributed strategies such as
TPUStrategy, is there any concern about setting a very large
steps_per_execution? When do the gradient all-reduces happen across pod devices when using large
steps_per_executionvalues? Does the
optimizer.apply_gradientsbehavior change with large