TypeError("dataset length is unknown.") while reading tfrecord

Rishik_Mourya · July 14, 2021, 5:13pm

Trying to read the tfrecord file and use it to train the model with .fit call but getting this error:

TypeError("dataset length is unknown.")

Here’s my tfrecord code:

FEATURE_DESCRIPTION = {
    'lr': tf.io.FixedLenFeature([], tf.string),
    'hr': tf.io.FixedLenFeature([], tf.string),
}

def parser(example_proto):
    parsed_example = tf.io.parse_single_example(example_proto, FEATURE_DESCRIPTION)
    lr = tf.io.decode_jpeg(parsed_example['lr'])
    hr = tf.io.decode_jpeg(parsed_example['hr'])
    return lr, hr

train_data = tf.data.TFRecordDataset(TFRECORD_PATH)\
                    .map(parser)\
                    .batch(BATCH_SIZE, drop_remainder = True)\
                    .prefetch(tf.data.AUTOTUNE)

And len(train_data) is giving error TypeError("dataset length is unknown.") because the cardinality is -2, or in other words the train_data is unable to capture the total number of samples because the dataset source is a file.
Is there any way we can tell the train_data how many samples/batches are there?

Bhack · July 14, 2021, 6:17pm

Check this thread:

https://tensorflow-prod.ospodiscourse.com/t/typeerror-dataset-length-is-unknown-tensorflow/

Rishik_Mourya · July 15, 2021, 4:10am

The solution is to manually set the cardinality as below:

# print(len(train_data)) gives error
train_data = train_data.apply(tf.data.experimental.assert_cardinality(NUM_BATCHES))
print(len(train_data)) # NUM_BATCHES