Keras Preprocessing - adapt multiple layers in one go

I’m a huge fan of tf.data mainly on how it speeds up the preprocessing of large datasets that don’t fit in memory. I have been using the Keras Preprocessing layers for a while now and I’m still struggling to overcome one main issue that is to adapt multiple layers at once.

In the example given to Introduce Preprocessing layers in Keras the author shows this snippet:

text_vectorizer = tf.keras.layers.TextVectorization(
     output_mode='multi_hot', max_tokens=2500)
features = train_ds.map(lambda x, y: x)
text_vectorizer.adapt(features)

normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(features.map(lambda x: tf.strings.length(x)))

def preprocess(x):
  multi_hot_terms = text_vectorizer(x)
  normalized_length = normalizer(tf.strings.length(x))
  # Combine the multi-hot encoding with review length.
  return tf.keras.layers.concatenate((multi_hot_terms, normalized_length))

def forward_pass(x):
  return tf.keras.layers.Dense(1)(x)  # Linear model.

inputs = tf.keras.Input(shape=(1,), dtype='string')
outputs = forward_pass(preprocess(inputs))
model = tf.keras.Model(inputs, outputs)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))
model.fit(train_ds, epochs=5)

Now because we’re calling adapt twice this code will iterate over the dataset two times. My question is if there is a way to change this code so that both layers will be adapted in one pass over the data? Some sort of Model class but for preprocessing like this:

class Preprocessor(tf.made_up_class.Preprocess):
  def __init__(**kwargs):
    self.text_vectorizer = tf.keras.layers.TextVectorization(
     output_mode='multi_hot', max_tokens=2500)
    self.normalizer() = tf.keras.layers.Normalization(axis=None)

  def adapt(self, x):
    vectorized_text = self.text_vectorizer(x)
    out = self.normalizer(vectorized_text)
    return out


preprocessor = Preprocessor()
preprocessor.adapt(features)

Maybe this specific example is tricky but many time one ends up fitting many StringLookup layers for different columns when using structured data, which can take hours if your data is big.

I saw this post about a new package but I’m not sure it preprocess the features in one pass.

1 Like

Hi @Geraud

Welcome to the TensorFlow Forum!

You can create a callable function for input standardization for the same task. After that pass this callable function inside the TextVectorization() keras layer in standardize argument to create a vectorize layer to be adapted by the input dataset for model training.

Please have a look at this example of Dataset preprocessing using TextVectorization for reference. Thank you.

1 Like

Thank you @Renu_Patel , this is nice if you want to extend the TextVectorization layer but it doesn’t solve the problem of having the call adapt for each Preprocessing layer individually. I think what I was looking for was a way to build a Preprocessing class as you would do when subclassing keras.Model, but instead of defining your froward step into call you would create a new adapt method and all preprocessors will be fitted over one iteration of the dataset. I hope this makes sense?

1 Like

Have the same question as @Geraud. In my case I have few TB dataset and hundreds of stateful preprocessing layers.

I see we can do that by calling update_state on a batch, but maybe TF would suggest something better:

sl = tf.keras.layers.StringLookup()

batches = [["a", "b"], ["c", "d"]]
for batch in batches:
  sl.update_state(batch)

sl.finalize_state()
sl._is_adapted = True

sl.get_vocabulary()

Thanks @Sergii_Makarevych and yes this is what I ended up doing, I also had to call _adapt_maybe_build before the first update_state for some reason.
Hopefully, there is (will be) a cleaner/easier way to do this kind of things as preprocessing the data is a big part of each flow.

1 Like