Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset?

I am trying to test a simple model using a SentencePieceTokenizer layer
over a (HuggingFace) dataset. But I seem unable to get the shape of
the dataset’s target to agree with the model’s output. All code
available here](https://github.com/rbelew/rikHak/blob/master/tst_240311.py)

First, I get the dataset from HF and convert it to the tensorflow
version that keras.Model.fit() expects using

trainDS = LH_dataset_HF['train'].to_tf_dataset(
        columns=["text"],
        label_cols=["answer"],
        batch_size=batch_size,
        shuffle=False,
        )

I can demonstrate the data is loaded and the SPTokenizer is working as
expected:

trainTF shape=(6, 3) answer shape=(6,)
all answers=[b'Yes' b'Yes' b'Yes' b'No' b'No' b'No']
echo1:  b'My roommate and I were feeling unwell in our basement apartment for a long ...

My model begins with a keras_nlp.tokenizers.SentencePieceTokenizer
layer, has one embedding layer, and then makes a prediction:

	Model: "functional_1"
	┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
	┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
	┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
	│ input_layer (InputLayer)        │ (None)                 │             0 │
	├─────────────────────────────────┼────────────────────────┼───────────────┤
	│ sentence_piece_tokenizer        │ (None, 32)             │             0 │
	│ (SentencePieceTokenizer)        │                        │               │
	├─────────────────────────────────┼────────────────────────┼───────────────┤
	│ embed (Embedding)               │ (None, 32, 100)        │     1,000,000 │
	├─────────────────────────────────┼────────────────────────┼───────────────┤
	│ predictions (Dense)             │ (None, 32, 1)          │           101 │
	└─────────────────────────────────┴────────────────────────┴───────────────┘
	Total params: 1,000,101 (3.82 MB)
	Trainable params: 1,000,101 (3.82 MB)
	Non-trainable params: 0 (0.00 B)

But when I try to model.fit(trainDS) I get

ValueError: Arguments `target` and `output` must have the same rank (ndim). Received: target.shape=(None,), output.shape=(None, 32, 1)

Questions

  1. Why is does target.shape=(None,) ?

  2. Is the model lacking a layer mapping the predictions to the answer
    strings? And/or, should the answer column be mapped to integers
    instead of strings?

Package versions

torch=2.1.0.post100
torchtext=0.16.1
tensorflow=2.15.0
tensorflow_text=2.15.0
keras=3.0.5
keras_nlp=0.7.0

I did an experiment using map to transform the string labels to integers:

def binaryLbl(txt,tlbl):
	return(txt,1 if tlbl=='Yes' else 0)

trainDS2 = trainDS.map(binaryLbl)

but trainDS2.element_spec() now says the label has NO shape?!

(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None))

Hi @Richard_Belew, This error occurs due to the last layer dense layer shape(output) and labels shape did not match.

Here the labels have shape (1,6) but the output layers have shape (None,32,1).

I have made few changes to your and was able to run the code without any error. please refer to this gist for working code example. Thank You.

1 Like

Fantastic @Kiran_Sai_Ramineni , thanks very much for your help on both of these!

The critical bits I’ve learned from your gist are:

  • using tf.where(label == "Yes", 1, 0) causes the label tensor to have “the shape of x, y, and condition broadcast together.”

  • then you do a second map:
    trainDS = trainDS.map(lambda text, label: (text, tf.expand_dims(label, -1)))
    to change the label tensor shape. Is there a reason this needs to be done separately from the first preprocessing map?

  • your model has extra layers, esp. layers.GlobalAveragePooling1D() (sandwiched between two Dropout layers) that is critical to getting the embedding layer down to the single unit.