TPU Issue: "ResourceExhaustedError: received trailing metadata size exceeds limit"

TofuC · May 26, 2022, 3:15am

Hi! This is my first time training with a TPU in Colab and I am facing an error I have never seen before. The error in question:

I can’t seem to figure out where the error is coming from. I thought that maybe it was a generic ResourceExhaustedError so I tried shrinking my model greatly (<400k total params), but I still face the same error. I get during the training loop for a U-Net style CNN. Any ideas on what I could look into? I can’t seem to find others who have had this error before, and I am not sure what a ResourceExhaustedError has to do with trailing metadata. I will try and get a simple example notebook to post and share later if we can’t figure it out here. Thanks!

TofuC · May 26, 2022, 2:13pm

Nothing yet everyone I have encountered with this issue had it in a completely different context (Google Ads API), and it seems like it is a general error that often obfuscates the underlying cause. Why I would have this error in the context of TPU usage is beyond me. It may be a Google Colab or Node JS server-side error, but I really have no clue (and am not sure who to ask)

TofuC · May 30, 2022, 6:27pm

Just updating on this; I ended up finding a solution that involved downscaling my input stream. Interestingly, I tried a 3D UNet with (80,160,160)-sized patches and it was giving me this error, but then I tried (40,80,80) and I was able to make it work (although I have lots of bugs and many of my metrics are NaN. The funny thing is that I can do huge batch sizes of (40,80,80) but cant even do a minimum size batch (1 per core) of (80,160,160). Much be an interesting thing with how the TPU works that I don’t understand. Seems like lots of bugs that need sorting out but I hope this can help someone.

a1Hassan · May 31, 2022, 12:36pm

Same issue with large model version of Efficient Net (B7) which requires (600,600) input resolution.

happycube · July 4, 2022, 3:32am

I just ran into the issue on Colab while using train_on_batch, looking at the stack trace I found that it happened in that function while decoding the logs. An alternate version I copied out of that that doesn’t do anything with the logs throws ignored Exceptions on cleanup, but it seems to keep running.

paujin_track · May 16, 2023, 7:37am

Thanks for this kinf of info