Filtering examples using TFX

imayachita · April 28, 2022, 12:29pm

Hello all,
In the dataset I am working on, there are a lot of data points that I want to filter, e.g. contain nan values, out-of-bound values, etc. I also want to do the same filtering on the data points in the inference time. Can I do it with TFX? Currently I am filtering them before TFX stages, similar to the examples here. The caveat of this approach is that the filtering can’t be automatically replicated during inference time. I have implemented some TFX transformations and I love it that these transformations can be automatically replicated by calling TFX transform graph layer, so I am thinking if I can do the same thing to filter out the invalid data points. I think the blocker I faced was that TFX needs to know the expected tensor shape (because of the TF graph computation) and with filtering, we wouldn’t be able to know the expected output tensor shape.

Thank you!

Robert_Crowe · April 29, 2022, 5:46pm

I’m not quite sure that I understand the problem. Are you filtering to remove features from individual examples, or filtering out whole examples? If you’re filtering out whole examples, how does that change the output tensor shape?

You can do that filtering in the TFX Transform component, and there are also related projects being developed in TFX-Addons that you’re welcome to contribute to:

imayachita · May 3, 2022, 9:27am

I want to filter out the whole examples, which will change the output tensor shape. For example, there are 2 features in the dataset, called “feat_1” and “feat_2” and there are 1000 examples in total. If there are 200 examples have None values in either “feat_1” and “feat_2” and they should be filtered out, then at the end we should have only 800 after filtering (and later fed to the model).