Preprocessing in TensorFlow

Good night,

I am working on a paper comparing Python libraries for machine learning and deep learning.

Trying to evaluate Keras and TensorFlow separately, I’m looking for information about TensorFlow methods or functions that can be used to preprocess datasets, such as those included in scikit-learn (sklearn.preprocessing) or the Keras preprocessing layers, but I can’t find anything beyond a one hot enconding for labels…

Does anyone know if what I am looking for exists?

Thank you very much!

Hi Dani,

Here are some links I have found to be helpful in this regard:

Tensorflow transforms
Getting started notebook
tft api

Tensorflow transform can actually even take in keras preprocessing layers, which certain caveats. It uses apache beam to scale the pipeline and does a lot to help with pipeline reproducibility.

As far as keras, here are some useful links
Good starting point

Specific examples:
Normalization
Discretization
Category Encoding
Hashed crossing

The scikit-learn comparison is especially interesting as the design choices of an all in-memory approach vs. a streaming approach become quite apparent. They do have a lot of commonalities such as the goal for using the same pipeline for training the data as is used at prediction time. The definition of pipeline itself is quite overloaded in the tensorflow ecosystem. For example, a tfx pipeline and a tft pipeline, how do they differ and what is their relationship with each other is an interesting point. For example, if I remember correctly, column_selector in scikit-learn can be directly integrated into the scikit learn pipeline, whereas in tfx, tensorflow dava validation handles inferring the schema, and tft uses the schema and enriches that, and other artifacts, for use downstream. As such, tfx feels much more de-coupled, but necessarily more complex and powerful with a steeper learning curve.

Hopefully this is enough to get you started, let me know if you need any further information.

Other layers are available at: