TFDS Error - Normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization

I’m using tfds to create a custom text dataset for classification with the BigBird NLP package. I’m receiving an error message “normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization” and I’m unable to find any information on how to resolve this error. I have included the log file below. BigBird completes the classification but the accuracy is 0.0 since no data from the custom tfds dataset appears to be flowing to BigBird classification in the correct format.

My GPUS are 0 (0 is the first GPU so one is assigned)
INFO[build.py]: Loading dataset from path: /mmfs1/home/0156fieldsj/tensorflow_datasets/my_dataset/my_dataset.py
INFO[dataset_info.py]: Load dataset info from /mmfs1/home/0156fieldsj/tensorflow_datasets/my_dataset/1.0.0
INFO[build.py]: download_and_prepare for dataset my_dataset/1.0.0…
INFO[dataset_builder.py]: Reusing dataset my_dataset (/mmfs1/home/0156fieldsj/tensorflow_datasets/my_dataset/1.0.0)
INFO[build.py]: Dataset generation complete…

tfds.core.DatasetInfo(
name=‘my_dataset’,
full_name=‘my_dataset/1.0.0’,
description="""
Description is formatted as markdown.

It should also contain any processing which has been applied (if any),
(e.g. corrupted example skipped, images cropped,...):
""",
homepage='https://www.tensorflow.org/datasets/catalog/my_dataset',
data_path='/mmfs1/home/0156fieldsj/tensorflow_datasets/my_dataset/1.0.0',
download_size=Unknown size,
dataset_size=379.77 KiB,
features=FeaturesDict({
    'essay': Text(shape=(), dtype=tf.string),
    'status': tf.int32,
}),
supervised_keys=None,
disable_shuffling=False,
splits={
    'test': <SplitInfo num_examples=30, num_shards=1>,
    'train': <SplitInfo num_examples=69, num_shards=1>,
},
citation="""""",

)

normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.

0%| | 0/199 [00:00<?, ?it/s]
42%|████▏ | 84/199 [00:00<00:00, 832.21it/s]
100%|██████████| 199/199 [00:00<00:00, 1123.67it/s]

0%| | 0/2000 [00:00<?, ?it/s]
0%| | 0/2000 [00:00<?, ?it/s]

0it [00:00, ?it/s]
0it [00:00, ?it/s]
Loss = 0.0 Accuracy = 0.0

I now believe that the issue is related to the text that I am importing from a CSV to create the custom tfds dataset. Should I be importing in as a text file? Do I need to encode with tfds.features.Text or another method?

I now believe this is related to the path issue described here → ModuleNotFoundError: No module named 'tensorflow_datasets' · Issue #1544 · tensorflow/datasets · GitHub

tfds was pointing to the my_datasets in my home directory and not my virtual environment env3. I haven’t updated create_new_datasets.py as suggested in 1544 because when I uninstalled and re-installed the tensorflow_datasets package to my home and env3 environments it resolved the issue to the point where I can now run BigBird.

Please close this issue. Thank you.