TensorBoard unable to see Debugger V2 data

Hi,

I’m using tf-nightly-gpu and tb-nightly and am trying to debug a model that is numerically unstable. I’d like to see where NaNs are occurring in the compute graph. So I’ve enabled debugging:

tf.debugging.experimental.enable_dump_debug_info("debug", tensor_debug_mode="FULL_HEALTH", circular_buffer_size=-1)

I run an epoch of training and then try to run tensorboard:

tensorboard --logdir=debug

I enable the Debugger V2 plugin but am always greeted with:

Debugger V2 is inactive because no data is available.

Here are the files written to the debug/ directory:

root@ce9fb22e47b0:/projects/FasterRCNN/tf2# ls debug -alh
total 195M
drwxr-xr-x 2 root root 4.0K Jan  4 18:37 .
drwxrwxr-x 7 1000 1000 4.0K Jan  4 18:40 ..
-rw-r--r-- 1 root root 5.6M Jan  4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.execution
-rw-r--r-- 1 root root 109M Jan  4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.graph_execution_traces
-rw-r--r-- 1 root root  18M Jan  4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.graphs
-rw-r--r-- 1 root root   71 Jan  4 18:26 tfdbg_events.1641320809.ce9fb22e47b0.metadata
-rw-r--r-- 1 root root 7.1M Jan  4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.source_files
-rw-r--r-- 1 root root  84K Jan  4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.stack_frames
-rw-r--r-- 1 root root 3.1M Jan  4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.execution
-rw-r--r-- 1 root root  18M Jan  4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.graph_execution_traces
-rw-r--r-- 1 root root  29M Jan  4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.graphs
-rw-r--r-- 1 root root   71 Jan  4 18:37 tfdbg_events.1641321476.ce9fb22e47b0.metadata
-rw-r--r-- 1 root root 7.1M Jan  4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.source_files
-rw-r--r-- 1 root root  85K Jan  4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.stack_frames

There is no indication of any other error from either TensorBoard or TensorFlow during training.

Thank you,

Bart

I got the same error and cannot find any useful docs in the internet. Hope anyone can help on it!

So I eventually solved it. There seemed to be some sort of incompatibility between the Tensorboard and TensorFlow versions. I can’t remember exactly what I did to resolve the issue but I ended up either matching the two versions (e.g., both nightly or both release), or sticking with the latest nightly build of TensorFlow and the latest stable release package of Tensorboard. I believe that initially, I was running an older nightly build of each and there was some sort of incompatibility.

1 Like