Minimal TF Example floods GPU RAM on Manjaro Linux

s2458588 · June 21, 2022, 2:11am

Hi TF-Community

I am new to Transformers and TF, so I am trying to get a grasp at it. Ultimately I would like to fine tune an MLM from pretrained and run similarity metrics on the embeddings.

TLDR; my model runs out of memory. How should I run the MLM to fine tune a pretrained model on a single text file?

My minimal example to ru nan MLM fine-tuning on a single txt file is running out of GPU RAM, here is what I did:

Installed the huggingface transformers git repo onto my local drive
Installed the pip requirements
Used this module’s example command line in the Readme, shown as follows

python run_mlm.py --model_name_or_path="bert-base-german-cased" --output_dir="tf-out" --train="tf-in/plenar.txt"

Both path parameters point to my working directory. The file is a small text file of 800 lines.

The output is:

(tensorflow-mlm) ➜  language-modeling git:(main) ✗ python run_mlm.py --model_name_or_path="bert-base-german-cased" --output_dir="tf-out" --train="tf-in/plenar.txt" 
2022-06-21 01:13:36.460315: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.498503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.498691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.499042: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-21 01:13:36.499936: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.500100: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.500222: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.853138: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.853329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.853486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-21 01:13:36.853573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2115 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
Using custom data configuration default-99b4571e1b72576f
Reusing dataset text (/home/gnom/.cache/huggingface/datasets/text/default-99b4571e1b72576f/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1342.18it/s]
loading configuration file https://huggingface.co/bert-base-german-cased/resolve/main/config.json from cache at /home/gnom/.cache/huggingface/transformers/98877e98ee76b3977d326fe4f54bc29f10b486c317a70b6445ac19a0603b00f0.1f2afedb22f9784795ae3a26fe20713637c93f50e2c99101d952ea6476087e5e
Model config BertConfig {
  "_name_or_path": "bert-base-german-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

loading configuration file https://huggingface.co/bert-base-german-cased/resolve/main/config.json from cache at /home/gnom/.cache/huggingface/transformers/98877e98ee76b3977d326fe4f54bc29f10b486c317a70b6445ac19a0603b00f0.1f2afedb22f9784795ae3a26fe20713637c93f50e2c99101d952ea6476087e5e
Model config BertConfig {
  "_name_or_path": "bert-base-german-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

loading file https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt from cache at /home/gnom/.cache/huggingface/transformers/0c57cb5172c1ac6c957d00597dc43c1b8b2a2cb44729a590fd0112612221f746.9a4f439638381be22bb9f116542bdaa5e1d8bb7a09a5f8ef32d9662deaf655a1
loading file https://huggingface.co/bert-base-german-cased/resolve/main/tokenizer.json from cache at /home/gnom/.cache/huggingface/transformers/a60c7a72be0cad1606096bd88aa22980c826a10b2482a850cfd50db5ceb3f01f.a1d3fa1580dc5318a8ad0477d679498575453bbe1ef5751aaca7fec558055f77
loading file https://huggingface.co/bert-base-german-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-german-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-german-cased/resolve/main/tokenizer_config.json from cache at /home/gnom/.cache/huggingface/transformers/2529d64cc99a539f2103ad09cea0d6459e181d8dc168fe06b32d25ddc68e6d3b.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/bert-base-german-cased/resolve/main/config.json from cache at /home/gnom/.cache/huggingface/transformers/98877e98ee76b3977d326fe4f54bc29f10b486c317a70b6445ac19a0603b00f0.1f2afedb22f9784795ae3a26fe20713637c93f50e2c99101d952ea6476087e5e
Model config BertConfig {
  "_name_or_path": "bert-base-german-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

Loading cached processed dataset at /home/gnom/.cache/huggingface/datasets/text/default-99b4571e1b72576f/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-0641023e369c7b29.arrow
Loading cached processed dataset at /home/gnom/.cache/huggingface/datasets/text/default-99b4571e1b72576f/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8/cache-a67168bf3c43cfc2.arrow
loading weights file https://huggingface.co/bert-base-german-cased/resolve/main/tf_model.h5 from cache at /home/gnom/.cache/huggingface/transformers/d59684a4900c4a328fb7083782550d4e554d3bb3e7aac2998cfa97815398e9b2.b7063093ea41ccd755fa8ced83a58867b5f5890f6bba5538ffae78a7b12e58f0.h5
2022-06-21 01:13:43.643185: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-german-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.
Epoch 1/3
2022-06-21 01:14:04.047178: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 96.00MiB (rounded to 100663296)requested by op tf_bert_for_masked_lm/bert/encoder/layer_._1/attention/self/MatMul
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-06-21 01:14:04.047311: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] BFCAllocator dump for GPU_0_bfc
2022-06-21 01:14:04.047357: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (256):  Total Chunks: 100, Chunks in use: 98. 25.0KiB allocated for chunks. 24.5KiB in use in bin. 505B client-requested in use in bin.
2022-06-21 01:14:04.047387: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (512):  Total Chunks: 1, Chunks in use: 1. 512B allocated for chunks. 512B in use in bin. 296B client-requested in use in bin.
2022-06-21 01:14:04.047418: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1024):         Total Chunks: 2, Chunks in use: 1. 2.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2022-06-21 01:14:04.047450: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2048):         Total Chunks: 339, Chunks in use: 339. 1016.2KiB allocated for chunks. 1016.2KiB in use in bin. 1015.0KiB client-requested in use in bin.
2022-06-21 01:14:04.047482: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4096):         Total Chunks: 6, Chunks in use: 6. 32.2KiB allocated for chunks. 32.2KiB in use in bin. 26.7KiB client-requested in use in bin.
2022-06-21 01:14:04.047513: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8192):         Total Chunks: 36, Chunks in use: 35. 436.0KiB allocated for chunks. 423.0KiB in use in bin. 414.0KiB client-requested in use in bin.
2022-06-21 01:14:04.047543: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16384):        Total Chunks: 11, Chunks in use: 10. 194.8KiB allocated for chunks. 178.8KiB in use in bin. 152.0KiB client-requested in use in bin.
2022-06-21 01:14:04.047572: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (32768):        Total Chunks: 6, Chunks in use: 5. 210.8KiB allocated for chunks. 162.8KiB in use in bin. 160.0KiB client-requested in use in bin.
2022-06-21 01:14:04.047599: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (65536):        Total Chunks: 3, Chunks in use: 3. 351.8KiB allocated for chunks. 351.8KiB in use in bin. 351.6KiB client-requested in use in bin.
2022-06-21 01:14:04.047621: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (131072):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-21 01:14:04.047648: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (262144):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-21 01:14:04.047673: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (524288):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-21 01:14:04.047702: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1048576):      Total Chunks: 3, Chunks in use: 2. 4.13MiB allocated for chunks. 3.00MiB in use in bin. 3.00MiB client-requested in use in bin.
2022-06-21 01:14:04.047735: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2097152):      Total Chunks: 147, Chunks in use: 146. 330.60MiB allocated for chunks. 328.50MiB in use in bin. 327.75MiB client-requested in use in bin.
2022-06-21 01:14:04.047766: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4194304):      Total Chunks: 2, Chunks in use: 2. 8.41MiB allocated for chunks. 8.41MiB in use in bin. 4.50MiB client-requested in use in bin.
2022-06-21 01:14:04.047797: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8388608):      Total Chunks: 91, Chunks in use: 90. 876.31MiB allocated for chunks. 864.31MiB in use in bin. 864.00MiB client-requested in use in bin.
2022-06-21 01:14:04.047828: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16777216):     Total Chunks: 1, Chunks in use: 1. 18.88MiB allocated for chunks. 18.88MiB in use in bin. 12.00MiB client-requested in use in bin.
2022-06-21 01:14:04.047859: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (33554432):     Total Chunks: 4, Chunks in use: 4. 192.00MiB allocated for chunks. 192.00MiB in use in bin. 192.00MiB client-requested in use in bin.
2022-06-21 01:14:04.047891: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (67108864):     Total Chunks: 6, Chunks in use: 6. 551.67MiB allocated for chunks. 551.67MiB in use in bin. 551.67MiB client-requested in use in bin.
2022-06-21 01:14:04.047922: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (134217728):    Total Chunks: 1, Chunks in use: 1. 130.78MiB allocated for chunks. 130.78MiB in use in bin. 87.89MiB client-requested in use in bin.
2022-06-21 01:14:04.047948: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (268435456):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-21 01:14:04.047977: I tensorflow/core/common_runtime/bfc_allocator.cc:1033] Bin for 96.00MiB was 64.00MiB, Chunk State: 
2022-06-21 01:14:04.047999: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 2217738240
2022-06-21 01:14:04.048026: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at
(THIS GETS REPEATED A LOT)
2022-06-21 01:14:04.056680: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free  at 7f3fd441d900 of size 12582912 next 759
2022-06-21 01:14:04.056684: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f3fd501d900 of size 19801856 next 18446744073709551615
2022-06-21 01:14:04.056688: I tensorflow/core/common_runtime/bfc_allocator.cc:1071]      Summary of in-use Chunks by size: 
2022-06-21 01:14:04.056696: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 98 Chunks of size 256 totalling 24.5KiB
2022-06-21 01:14:04.056701: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 512 totalling 512B
2022-06-21 01:14:04.056705: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 1280 totalling 1.2KiB
2022-06-21 01:14:04.056710: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 2048 totalling 4.0KiB
2022-06-21 01:14:04.056891: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 335 Chunks of size 3072 totalling 1005.0KiB
2022-06-21 01:14:04.056896: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 3584 totalling 3.5KiB
2022-06-21 01:14:04.056900: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 3840 totalling 3.8KiB
2022-06-21 01:14:04.056904: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 4608 totalling 4.5KiB
2022-06-21 01:14:04.056909: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 5120 totalling 5.0KiB
2022-06-21 01:14:04.056913: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 5376 totalling 5.2KiB
2022-06-21 01:14:04.056917: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 5632 totalling 5.5KiB
2022-06-21 01:14:04.056922: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 6144 totalling 12.0KiB
2022-06-21 01:14:04.056926: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 9216 totalling 9.0KiB
2022-06-21 01:14:04.056931: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 32 Chunks of size 12288 totalling 384.0KiB
2022-06-21 01:14:04.056935: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 15360 totalling 30.0KiB
2022-06-21 01:14:04.056940: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 7 Chunks of size 16384 totalling 112.0KiB
2022-06-21 01:14:04.056944: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 21504 totalling 42.0KiB
2022-06-21 01:14:04.056949: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 25344 totalling 24.8KiB
2022-06-21 01:14:04.056954: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 32768 totalling 128.0KiB
2022-06-21 01:14:04.056958: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 35584 totalling 34.8KiB
2022-06-21 01:14:04.056963: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 120064 totalling 351.8KiB
2022-06-21 01:14:04.056967: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 1572864 totalling 3.00MiB
2022-06-21 01:14:04.056972: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 146 Chunks of size 2359296 totalling 328.50MiB
2022-06-21 01:14:04.056976: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 4306944 totalling 4.11MiB
2022-06-21 01:14:04.056981: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 4506624 totalling 4.30MiB
2022-06-21 01:14:04.056985: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 71 Chunks of size 9437184 totalling 639.00MiB
2022-06-21 01:14:04.056990: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 9765888 totalling 9.31MiB
2022-06-21 01:14:04.056994: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 18 Chunks of size 12582912 totalling 216.00MiB
2022-06-21 01:14:04.056999: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 19801856 totalling 18.88MiB
2022-06-21 01:14:04.057004: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 50331648 totalling 192.00MiB
2022-06-21 01:14:04.057008: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 92160000 totalling 263.67MiB
2022-06-21 01:14:04.057013: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 100663296 totalling 288.00MiB
2022-06-21 01:14:04.057017: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 137134080 totalling 130.78MiB
2022-06-21 01:14:04.057022: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 2.05GiB
2022-06-21 01:14:04.057026: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 2217738240 memory_limit_: 2217738240 available bytes: 0 curr_region_allocation_bytes_: 4435476480
2022-06-21 01:14:04.057033: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats: 
Limit:                      2217738240
InUse:                      2201690880
MaxInUse:                   2214273792
NumAllocs:                        2677
MaxAllocSize:                144340992
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-06-21 01:14:04.057235: W tensorflow/core/common_runtime/bfc_allocator.cc:474] ****************************x***********************************************************************
2022-06-21 01:14:04.057258: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at matmul_op_impl.h:681 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[8,12,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/home/gnom/PycharmProjects/bachelor/transformers-mlm/examples/tensorflow/language-modeling/run_mlm.py", line 588, in <module>
    main()
  File "/home/gnom/PycharmProjects/bachelor/transformers-mlm/examples/tensorflow/language-modeling/run_mlm.py", line 558, in main
    history = model.fit(
  File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

Detected at node 'tf_bert_for_masked_lm/bert/encoder/layer_._1/attention/self/MatMul' defined at (most recent call last):
    File "/home/gnom/PycharmProjects/bachelor/transformers-mlm/examples/tensorflow/language-modeling/run_mlm.py", line 588, in <module>
      main()
    File "/home/gnom/PycharmProjects/bachelor/transformers-mlm/examples/tensorflow/language-modeling/run_mlm.py", line 558, in main
      history = model.fit(
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/training.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/training.py", line 1021, in train_function
      return step_function(self, iterator)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/training.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/training.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/modeling_tf_utils.py", line 996, in train_step
      y_pred = self(x, training=True)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/modeling_tf_utils.py", line 1314, in run_call_with_unpacked_inputs
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 1330, in call
      outputs = self.bert(
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/modeling_tf_utils.py", line 1314, in run_call_with_unpacked_inputs
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 869, in call
      encoder_outputs = self.encoder(
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 554, in call
      for i, layer_module in enumerate(self.layer):
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 560, in call
      layer_outputs = layer_module(
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 470, in call
      self_attention_outputs = self.attention(
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 386, in call
      self_outputs = self.self_attention(
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gnom/builds/anaconda3/envs/tensorflow-mlm/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 316, in call
      attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
Node: 'tf_bert_for_masked_lm/bert/encoder/layer_._1/attention/self/MatMul'
OOM when allocating tensor with shape[8,12,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node tf_bert_for_masked_lm/bert/encoder/layer_._1/attention/self/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_23722]

We can see it has used the GPU:0 device to but quickly runs out of memory. The last line tells me

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn’t available when running in Eager mode.

I can run a debugging option. I do not know how to pass it and if it is helpful. There are no other processes occupying my GPU RAM at that time. This is my second attempt to train the model on a single file. Pytorch yielded similar results which is why I turned to Tensorflow. But as the errors are comparable this seems to be an issue on my side. How should I run the MLM to fine tune on a text file?