I have a simple Monte-Carlo simulation model implemented in Tensorflow (v 2.5). The calculations comprise of Tensor multiplications inside a tf.while_loop. I am trying to benchmark the performance using XLA on a CPU with 8 cores. What I observe is when I run the code in graph mode (with xla compile), the CPU utilization is close to 800% (all cores fully utilized). However, when I run the graph after xla compile (jit_complie=True), the cpu utilization falls to approximately 200%. Is there a way to force XLA to fully utilize all cores?
Note: I have experimented with the changing the inter_op_parallelism and intra_op_parallelism settings. While setting both of the threads settings to 1 reduces the CPU utilization from 200% to 100%, increasing them to 8 doesn’t increase the utilization beyond 200%.