Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset?

Richard_Belew · March 11, 2024, 10:55pm

I am trying to test a simple model using a SentencePieceTokenizer layer
over a (HuggingFace) dataset. But I seem unable to get the shape of
the dataset’s target to agree with the model’s output. All code
available here](https://github.com/rbelew/rikHak/blob/master/tst_240311.py)

First, I get the dataset from HF and convert it to the tensorflow
version that keras.Model.fit() expects using

trainDS = LH_dataset_HF['train'].to_tf_dataset(
        columns=["text"],
        label_cols=["answer"],
        batch_size=batch_size,
        shuffle=False,
        )

I can demonstrate the data is loaded and the SPTokenizer is working as
expected:

trainTF shape=(6, 3) answer shape=(6,)
all answers=[b'Yes' b'Yes' b'Yes' b'No' b'No' b'No']
echo1:  b'My roommate and I were feeling unwell in our basement apartment for a long ...

My model begins with a keras_nlp.tokenizers.SentencePieceTokenizer
layer, has one embedding layer, and then makes a prediction:

	Model: "functional_1"
	┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
	┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
	┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
	│ input_layer (InputLayer)        │ (None)                 │             0 │
	├─────────────────────────────────┼────────────────────────┼───────────────┤
	│ sentence_piece_tokenizer        │ (None, 32)             │             0 │
	│ (SentencePieceTokenizer)        │                        │               │
	├─────────────────────────────────┼────────────────────────┼───────────────┤
	│ embed (Embedding)               │ (None, 32, 100)        │     1,000,000 │
	├─────────────────────────────────┼────────────────────────┼───────────────┤
	│ predictions (Dense)             │ (None, 32, 1)          │           101 │
	└─────────────────────────────────┴────────────────────────┴───────────────┘
	Total params: 1,000,101 (3.82 MB)
	Trainable params: 1,000,101 (3.82 MB)
	Non-trainable params: 0 (0.00 B)

But when I try to model.fit(trainDS) I get

ValueError: Arguments `target` and `output` must have the same rank (ndim). Received: target.shape=(None,), output.shape=(None, 32, 1)

Questions

Why is does target.shape=(None,) ?
Is the model lacking a layer mapping the predictions to the answer
strings? And/or, should the answer column be mapped to integers
instead of strings?

Package versions

torch=2.1.0.post100
torchtext=0.16.1
tensorflow=2.15.0
tensorflow_text=2.15.0
keras=3.0.5
keras_nlp=0.7.0

Richard_Belew · March 12, 2024, 7:37pm

I did an experiment using map to transform the string labels to integers:

def binaryLbl(txt,tlbl):
	return(txt,1 if tlbl=='Yes' else 0)

trainDS2 = trainDS.map(binaryLbl)

but trainDS2.element_spec() now says the label has NO shape?!

(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None))

Kiran_Sai_Ramineni · March 28, 2024, 10:03am

Hi @Richard_Belew, This error occurs due to the last layer dense layer shape(output) and labels shape did not match.

Here the labels have shape (1,6) but the output layers have shape (None,32,1).

I have made few changes to your and was able to run the code without any error. please refer to this gist for working code example. Thank You.

Richard_Belew · March 28, 2024, 3:58pm

Fantastic @Kiran_Sai_Ramineni , thanks very much for your help on both of these!

The critical bits I’ve learned from your gist are:

using tf.where(label == "Yes", 1, 0) causes the label tensor to have “the shape of x, y, and condition broadcast together.”
then you do a second map:
trainDS = trainDS.map(lambda text, label: (text, tf.expand_dims(label, -1)))
to change the label tensor shape. Is there a reason this needs to be done separately from the first preprocessing map?
your model has extra layers, esp. layers.GlobalAveragePooling1D() (sandwiched between two Dropout layers) that is critical to getting the embedding layer down to the single unit.