Hello everyone,
I want to reproduce this example, but I would like the input_data
to be a Text Nested List instead. In other words:
text_dataset = tf.data.Dataset.from_tensor_slices(["foo", "bar", "baz"])
max_features = 5000
max_len = 4
# -------- EXPERIMENTAL ADDENDUM -----------------
def split_on_comma(input_data):
return tf.strings.split(input_data, sep=", ")
# ------------------------------------------------
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len,
split=split_on_comma
)
with tf.device("/CPU:0"):
vectorize_layer.adapt(text_dataset.batch(64))
inputs = tf.keras.Input(shape=(1,), dtype=tf.string)
x = vectorize_layer(inputs)
model = tf.keras.Model(inputs=inputs, outputs=x)
# input_data = [["foo qux bar"], ["qux baz"]] # Original input, I DO NOT want this...
input_data = [["foo", "qux", "bar"], ["qux baz"]] # ...I WANT THIS AS INPUT INSTEAD
input_data = [[', '.join(x)] for x in input_data] # MY EXPERIMENTAL WORKAROUND
model.predict(input_data)
The thing is that I have a dataset where one of the features consists of a list of strings, with different length each, that do not share any context with each other, and I want to use it as training input for some Model with the Functional API. Splitting it in different columns (OHE) is out of question, as this feature is highly sparse (~7K different classes).
I am a bit concerned that maybe my implementation is not the best neither correct, because I am getting different results; for instance, this is the example’s original output:
array([[2,1,4,0],
[1,3,0,0]])
And what I am getting is:
array([[1,0,0,0],
[1,0,0,0]])
Finally, I am not quite sure how (if?) Ragged Tensors would be of use with TextVectorization layer (again, the goal is building a Model using the Functional API). I have found some examples, but not exactly focused towards TextVectorization
. Plus, I get that TextVectorization can process “one string per sample”, but if so, what would be other suggestions to what I am aiming here?
Have someone have had this problem before? How did you tackled it?
Thank you very much.