Trouble using tfds load to vectorize text

Hello

I am trying to vectorize text (wikipedia) using tfds load. I am trying to do something like this

This nlp example contains imdb reviews data and i was able to successfully follow it. But i am not able to do it for wikipedia dataset. Apparently there is some inherent difference between the types of datasets.

I have tried the following

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers.experimental.preprocessing  import TextVectorization
# Load Wikipedia dataset from tfds
dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True, split=tfds.Split.TRAIN)

print(type(dataset))
for i in dataset:
    print(i['text'].numpy().decode('utf-8'))

# Create a TextVectorization layer to convert text to vectors
vectorize_layer = TextVectorization(
    max_tokens=100,
    output_mode='int',
    output_sequence_length=50
)

# Adapt the vectorization layer to the dataset
vectorize_layer.adapt(dataset.map(lambda x,y: x['text']))

model = tf.keras.Sequential([
    vectorize_layer,
    tf.keras.layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=64, mask_zero=True),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

This much runs without a problem. But when i fit the model

model.fit(dataset, epochs=5)

Then i get the error

TypeError: Expected string passed to parameter ‘input’ of op ‘StringLower’, got {‘text’: <tf.Tensor ‘IteratorGetNext:0’ shape=() dtype=string>, ‘title’: <tf.Tensor ‘IteratorGetNext:1’ shape=() dtype=string>} of type ‘dict’ instead. Error: Expected string, got <tf.Tensor ‘IteratorGetNext:0’ shape=() dtype=string> of type ‘Tensor’ instead.

What can i do?

thanks

Hi @srivassid, while reproducing the error by executing the given code, I am facing the error at the above line. I have gone through the dataset and can see that there are no labels present inside the dataset only dict_keys([‘text’, ‘title’]) are present. Could you please let us know how you have defined the labels.

if you see the data inside the dataset it will be

{'text': <tf.Tensor: shape=(), dtype=string, numpy=b'\xd0\xbf\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbd\xd0\xb0\xd0\xbf\xd1\x80\xd0\xb0\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\x9b\xd0\xb0\xd0\xb1\xd0\xb0\xd1\x85\xd3\x99\xd1\x83\xd0\xb0, \xd0\x90\xd0\xbb\xd0\xb8\xd0\xb0\xd1\x81 \xd0\x9c\xd0\xb8\xd1\x85\xd0\xb0-\xd0\xb8\xd4\xa5\xd0\xb0'>, 

'title': <tf.Tensor: shape=(), dtype=string, numpy=b'\xd0\x90\xd0\xbb\xd0\xb8\xd0\xb0\xd1\x81 \xd0\x9b\xd0\xb0\xd0\xb1\xd0\xb0\xd1\x85\xd3\x99\xd0\xb0'>}

You are passing the dataset elements directly instead of string. This is the cause of error. Could you try by passing the string only. Thank You.

Hi

I solved it using the following code

dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True)

# Prepare the text data
texts = [example['text'].numpy().decode('utf-8') for example in dataset['train']]
labels = [0] * len(texts)  # Dummy labels for illustration purposes

# Create a TextVectorization layer
vectorize_layer = TextVectorization(
    max_tokens=50000,  # You can adjust this value based on your requirements
    output_mode='tf-idf',
)

# Adapt the layer to the text data
vectorize_layer.adapt(texts)

# Vectorize the text data
vectorized_texts = vectorize_layer(texts)
labels = tf.convert_to_tensor(labels, dtype=tf.float32)


# Build a simple neural network model
model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(vectorize_layer.vocabulary_size(),)),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(vectorized_texts, labels, epochs=5)