Question about Preprocessing of Text Data

Rudolf_Debelak · July 27, 2023, 3:30am

I am currently working through the example “Text Classification with TF Hub” on the Tensorflow for R website (TensorFlow for R - Text Classification with TF Hub), and try to replicate this model with some of my own data.

In the example, data are first loaded from external text files via text_dataset_from_directory and saved as a BatchDataset of shape (None,), which is named train_data. In my own data, the texts and labels are available as vectors of length 80 in R. When I try to convert these data, I obtain a BatchDataset of shape (None,80), which is incompatible with the Tensorflow model defined later.

My question is how I should preprocess my data to apply them with the Tensorflow Hub model.

I am providing a minimal example based on my R code. In the end, I obtain a Value Error, because the model expects data with shape (None,), but gets data with shape (None,80). I am grateful for any help.

# Loading the packages
library(keras) 
library(tensorflow)
library(tfdatasets)
library(tfhub)

# We transform the vectors into a tensorflow dataset:
# texts_train and labels_train are vectors of length 80

texts_train_tf <- as_tensor(texts_train) # Has shape (80)
labels_train_tf <- as_tensor(labels_train) # Has shape (80)

train_tf <- tensors_dataset(c(texts_train_tf,labels_train_tf)) # TensorDataset of shape (80,)
train_data_tf <- dataset_batch(train_tf, batch_size = 32) # BatchDataset of shape (80,)

# Exploring the data
batch_tf <- train_data_tf %\>%
reticulate::as_iterator() %\>%
reticulate::iter_next()

# Following the example "Text Classification with TF Hub"
embedding <- "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer <- tfhub::layer_hub(handle = embedding, trainable = TRUE)

model <- keras_model_sequential() %\>%
hub_layer() %\>%
layer_dense(32, activation = "relu") %\>%
layer_dense(1)

model %\>% compile(
optimizer = 'adam',
loss = 'mean_squared_error',
metrics = 'accuracy'
)

history <- model %\>% fit(
train_data_tf,
epochs = 10,
verbose = 1
)

Fitting this model leads to the following error message:
Error: ValueError: in user code: <… omitted …> File "C:\Users\RDEBEL~1\AppData\Local\Temp_autograph_generated_filei_8nea4e.py", line 37, in if_body_3 result = ag_.converted_call(ag__.ld(f), (), None, fscope)

ValueError: Exception encountered when calling layer ‘keras_layer_16’ (type KerasLayer).

in user code:

File “C:\Users\RDEBEL~1\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\tensorflow_hub\keras_layer.py”, line 234, in call * result = f()

ValueError: Python inputs incompatible with input_signature: inputs: ( Tensor(“IteratorGetNext:0”, shape=(None, 80), dtype=string)) input_signature: ( TensorSpec(shape=(None,), dtype=tf.string, name=None)).

Call arguments received by layer ‘keras_layer_16’ (type KerasLayer): • inputs=tf.Tensor(shape=(None, 80), dtype=string) • training=True

See reticulate::py_last_error() for details