Great updates from the TFX team at I/O. I am also a newbie in this area and after completing a few lessons from the Coursera MLOps Specialization, I got instantly hooked to TFX. Such an amazing framework! Data validator is my favorite component so far. @Robert_Crowe thank you for teaching it so beautifully.
In this video, I learned about TFX for NLP. As someone heavily into Computer Vision, a question arises:
Thanks, @Robert_Crowe. I think it might be good to have support for more involved vision metrics like mAP, IoU, etc. since they are common object detection and segmentation respectively. Both of them are quite extensively used.
Thanks Sayak, and I agree that for object detection and segmentation those would be great metrics to have in TFMA. We also need validation measures in TFDV, and I know that some folks are working on that.
This is fantastic feedback Sayak - one of the first things I think the TFMA team will ask for when presented with a FR for a new metric implementation is a set of (or at least 1) large benchmark they can use to test/finetune the metric implementation.
So if you’re (or anyone else) aware of popular tasks defined on large datasets where modelers are interested in computing such accuracy metrics, please do add them
Sure. Could you shed some more light on the large benchmark you mentioned? Something like state-of-the-art accuracy on the ImageNet-1k dataset? If so, I think you could use TensorFlow Model Garden to see what performance their implementations of image classification models get us to.
Similarly you will find implementations of other and most commonly used vision tasks – object detection and semantic segmentation. All have well established baselines on large datasets like MS-COCO, Places365, etc. For detection, mAP is the single most important metric we care about and for segmentation, IoU should do just fine.
As I am writing, I think the suite of vision APIs from TFX could also enable the following things:
Validator components to help developers understand the distribution of a large image dataset. I personally think this tooling is not available for enterprise-grade ML.
Utilities for running fast and distributed image similarity searches. There are already well established techniques for compressing embeddings and using them for fast retrieval. But having them integrated inside TFX would be a fun ride I think.
I am starting as an MLE at a startup today, in fact. So expect more TFX related stuff from me in the coming days
This is great! To answer your question specifically, it’s extremely useful for the team not to just know what evaluation metrics (e.g., mAP) are useful to the community, but also what tasks (or task types) the community wants to evaluate their models on, that use that metric.
This allows the team to
have a concrete benchmark to test the metric computation implementation and finetune performance (parallelization of compute, resource usage, etc.)
possibly easily develop an example to showcase the new metric’s availability in TFMA/TFDV
Many teams using TFX will have their own proprietary datasets that you can’t share with us. But it would be really useful to know, which public tasks  map most closely to ones the community is interested in.
Could you expand a bit more on this? I agree with the rest of the points mentioned for validation. Checking for format corruption might be another good addition.
Anomaly detection for vision datasets is way more nuanced than many other modalities I believe. For text-based problems, it’s easier to compute the descriptive statistics and compare examples against those but just getting the mean and std of a couple million images can be an expensive operation (given we operating on high-res images like 224x224x3).
Another post-training technique to discard some of the anomalies inside the training set is to plot the samples that cause a model to incur high loss values. Here’s a PoC of this technique. I learned this at one of the fast.ai classes.
Computing the distribution overlap is also not very straightforward for large vision datasets. In my experience, simple recipes like computing histogram frequencies often fail to capture the important trends.
One cheap but often effective technique I have seen on Kaggle (for tabular datasets in most of the cases) is to train a simple model to classify whether a given data point is from the training distribution. How this works in practice? We take the entries of the training set, discard their labels, and indicate 1s as their new labels to all the training data points. This is the training data for the simple model now. We take the trained model to infer on the validation and test sets to investigate the overlap.
All of this probably calls for a well-structured paper
For feature engineering, I think for most DL-related vision workflows users would want to have their own data augmentation pipelines incorporated.
What I meant was that a bad image can have very little hue and value variance, which is the case if the lens cap is on, or I take a very out of focus image like when things are way too close, or an image with not nearly enough light, or an image of the sun, or an image of the inside of my pocket, etc.