Help with tfds.dataset_builders.store_as_tfds_dataset


I created an image encoder. I want to save the latent vectors in a tfds.Dataset so I can reuse them later. The basic idea is:

  1. image → latent vector → save on disk
  2. load latent vector → use vector in other models

I don’t know if creating a dataset build class will work. So, I want to try tfds.dataset_builders.store_as_tfds_dataset first.

Here’s the code example:

# preprocess_image output image of shape (128, 128, 3)

(imagenet_ds,) = tfds.load("imagenette/160px-v2", split=["all"])
image_ds = x: preprocess_image(x["image"])).batch(
    256, drop_remainder=False

def ds_generator():
    for i in image_ds.as_numpy_iterator():
        x, *_ = encoder_apply(encoder_state, i, mask_ratio=0.0, rngs=rngs)
        yield tf.convert_to_tensor(x, dtype=tf.bfloat16)

image_lantent_ds =
    output_signature=(tf.TensorSpec(shape=(None, 256, 128), dtype=tf.bfloat16)),

image_builder = tfds.dataset_builders.store_as_tfds_dataset(
    name="image lantent",
        {"latent": features.Tensor(shape=(256, 128), dtype=tf.bfloat16)}, length=256
    description="imagenet/v2 MAE latent vectors",
    release_notes={"0.0.1": "Uses mae-imagenette-918_20240606-0236 MAE checkpoint"},
    split_datasets={"lantent": image_lantent_ds},

When I run this code, I got the following error:

TypeError: Failed to encode example:
[...a giant array...]
unhashable type: 'numpy.ndarray'

I think my features parameter is wrong, but I could not find many examples online. The documentation is very vague.