TF.dataset.cache(path) still uses memory despite the cache file path is given in tf2.5-2.8

Hi,
I came across a weird problem when I read TFrecords files from S3 through tf.dataset and cached them to my local path. Here is my reading code

    filenames=['s3s:path1', ''s3s:path2']
    dataset = tf.data.TFRecordDataset(filenames, compression_type="GZIP")
    parsed_dataset = (
        dataset.batch(batch_size, num_parallel_calls=tf.data.AUTOTUNE)
        .map(decode, num_parallel_calls=tf.data.AUTOTUNE)
        .cache(cache_file_path)
        .prefetch(tf.data.AUTOTUNE)
    )

It’s very strange that cache() still uses the internal memory which results in OOM. Here is the memory usage I printed via callback during training.

2022-03-08T22:19:40.154191003Z ...Training: end of batch 15700; got log keys: ['loss', 'copc', 'auc']
2022-03-08T22:19:40.159188560Z totalmemor: 59.958843GB
2022-03-08T22:19:40.159223737Z availablememory: 8.418320GB
2022-03-08T22:19:40.159250296Z usedmemory: 50.959393GB
2022-03-08T22:19:40.159257814Z percentof used memory: 86.000000
2022-03-08T22:19:40.159263710Z freememory:1.072124GB
2022-03-08T22:19:47.752077011Z Tue Mar  8 22:19:47 UTC 2022	job-submitter:	job run error: signal: killed

I have tested the code on TF2.3 which has no such an issue, but TF2.5 and onwards have such an OOM issue. I am not sure whether or not this is bug or configuration problem. Could anyone help to answer or give some clues about this problem?

4 Likes

Did you find the issue? Same happens on 2.9

i wonder if this is related: