Can't get datasets.Dataset.to_tf_dataset() to produce tensors with right shape?!

Richard_Belew · March 15, 2024, 12:34am

i’m following this doc:

trying to convert it to be used by this model:

inputs = keras.Input(shape=(), dtype="string")
x = SPtokenizer(inputs)
x = layers.Embedding(input_dim=SPtokenizer.vocabulary_size(), output_dim=embed_size,name="embed")(x)
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)
model = keras.Model(inputs,predictions)

I get a (HUgging Face) datasets.Dataset like this:

LH_dataset_HF = datasets.load_dataset("nguha/legalbench", 'learned_hands_torts')

I create a keras/TF friendly dataset like this:

trainDS = renameDS['train'].to_tf_dataset(
            columns=[text"
            label_cols="answer",
            batch_size=batch_size,
            shuffle=False,
            )

the resulting trainDS looks like this:

_PrefetchDataset: <_PrefetchDataset
element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None),
TensorSpec(shape=(None,), dtype=tf.string, name=None))>

when I try to give it to model.fit(trainDS) producing this error:

ValueError: Arguments target and output must have the same rank (ndim). Received: target.shape=(None,), output.shape=(None, 32, 1)

i have also tried: applying a map to make the labels integers:

def binaryLbl(txt,tlbl):
    if tlbl=='Yes':
        ilbl = 1
    else:
        ilbl = 0
    return txt,ilbl

trainDS2 = trainDS.map(binaryLbl)

it produces a trainDS2 that looks like this:

_MapDataset: <_MapDataset
element_spec=(TensorSpec(shape=(None,),dtype=tf.string, name=None),
TensorSpec(shape=(), dtype=tf.int32,name=None))>

and generates this error:

ValueError: Arguments `target` and `output` must have the same rank (ndim). Received: target.shape=(), output.shape=(None, 32, 1)

what can i do to make the dataset conform to the model’s expectation?! Thanks for any help.

Kiran_Sai_Ramineni · March 28, 2024, 10:12am

Hi @Richard_Belew, I have gone through your model, and I can see that you are using SPtokenizer in the model itself. I recommend you to tokenize the text data using the pre processing function, and convert labels to categorical before passing the raw data to the model like,

def preprocess_text(text, label):
    tokens = SPtokenizer.tokenize(text)
    input_ids = tokens
    answer = tf.where(label == "Yes", 1, 0)
    return input_ids, answer
trainDS = trainDS.map(preprocess_text)

This error occurs due to the last layer dense layer shape(output) and label shape did not match.

I have made few changes to your and was able to run the code without any error. please refer to this gist for working code example. Thank You.

Richard_Belew · March 28, 2024, 3:57pm

Fantastic @Kiran_Sai_Ramineni , thanks very much!
(I’ve responded further in the redundant post over here: Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset? - #7 by Richard_Belew)