TF.dataset.cache(path) still uses memory despite the cache file path is given in tf2.5-2.8

Yusen_Zhan · March 10, 2022, 3:09am

Hi,
I came across a weird problem when I read TFrecords files from S3 through tf.dataset and cached them to my local path. Here is my reading code

    filenames=['s3s:path1', ''s3s:path2']
    dataset = tf.data.TFRecordDataset(filenames, compression_type="GZIP")
    parsed_dataset = (
        dataset.batch(batch_size, num_parallel_calls=tf.data.AUTOTUNE)
        .map(decode, num_parallel_calls=tf.data.AUTOTUNE)
        .cache(cache_file_path)
        .prefetch(tf.data.AUTOTUNE)
    )

It’s very strange that cache() still uses the internal memory which results in OOM. Here is the memory usage I printed via callback during training.

2022-03-08T22:19:40.154191003Z ...Training: end of batch 15700; got log keys: ['loss', 'copc', 'auc']
2022-03-08T22:19:40.159188560Z totalmemor: 59.958843GB
2022-03-08T22:19:40.159223737Z availablememory: 8.418320GB
2022-03-08T22:19:40.159250296Z usedmemory: 50.959393GB
2022-03-08T22:19:40.159257814Z percentof used memory: 86.000000
2022-03-08T22:19:40.159263710Z freememory:1.072124GB
2022-03-08T22:19:47.752077011Z Tue Mar  8 22:19:47 UTC 2022	job-submitter:	job run error: signal: killed

I have tested the code on TF2.3 which has no such an issue, but TF2.5 and onwards have such an OOM issue. I am not sure whether or not this is bug or configuration problem. Could anyone help to answer or give some clues about this problem?

Roshan · June 9, 2022, 2:23am

Did you find the issue? Same happens on 2.9

ctargon · January 5, 2023, 7:11pm

i wonder if this is related:

github.com/tensorflow/tensorflow

Memory Leak with Tensorflow Experimental Save

opened 04:03AM - 20 May 22 UTC

Jesse-Kerr

stat:awaiting tensorflower comp:data type:performance TF 2.8

<details><summary>Click to expand!</summary> ### Issue Type Performance ##…# Source source ### Tensorflow Version 2.8 ### Custom Code Yes ### OS Platform and Distribution _No response_ ### Mobile device _No response_ ### Python version 3.9 ### Bazel version _No response_ ### GCC/Compiler version _No response_ ### CUDA/cuDNN version _No response_ ### GPU model and memory _No response_ ### Current Behaviour? ```shell I have a loop where I am creating tensorflow datasets and then saving to directories for later use using tf.data.experimental.save. I found that as the loop progresses, the memory being used greatly increases, until the process eventually crashes. In this first example below, I do not save the file, and the memory stays at the same level throughout. However, if I save each dataset, the memory used increases each time. Adding tf.keras.backend.clear_session() after each save appears to slow down the memory growth but doesn't fully stop it. Thank you in advance for any help. ``` ### Standalone code to reproduce the issue ```shell import tensorflow as tf import numpy as np from humanize import naturalsize import psutil # First example, no memory increase for i in range(10000): if i in range(1, 10001, 100): print(naturalsize(psutil.Process().memory_info().rss)) data = tf.data.Dataset.from_tensors(np.array([0, 1, 2])) # Second example, memory increase for i in range(10000): if not i % 100: print(naturalsize(psutil.Process().memory_info().rss)) data = tf.data.Dataset.from_tensors(np.array([0, 1, 2])) tf.data.experimental.save(data, path='~/Desktop/1') # Grows by about 5 MB each 100 runs. ``` ### Relevant log output _No response_</details>