Multilabel categorization and Tensorflow Decision Forests

Mog · August 10, 2022, 9:44am

Hello everybody

I am trying to learn how to use TFDF. I keep hearing that decision trees based models are fast to train and query so I wanted to see how much I could use them at work.

I have a multilabel classification task so I used StringLookup to map the labels into a multihot vector. The GradientBoostedTreesModel is complaining ValueError: Can not squeeze dim[1], expected a dimension of 1, got 1184. I have 1184 unique labels so the transform is fine but apparently that’s not an acceptable format for the target. Is this out of scope for these models? Is it just that I need to pass some parameter to the model object when I create it? I think I saw that multiclass is handled out of the box so do I “simply” need to split the vector into one tensor for each class? That feels a bit inefficient.

Thanks

Edit: PS is there a better place/forum to ask this question?

lgusm · August 11, 2022, 2:45pm

Hi Mog,

I think @Mathieu can help answer the question here

Mathieu · August 15, 2022, 7:38am

Hello Mog,

TF-DF / YDF does not support multi-label classification. While research works [1] exist on this topic, support for multi-label classification with decision forests is more complex than with other models (such as Neural Networks), and so we don’t offer any solution for this yet.

As you mentioned, predicting each label independently is not be suited for all applications: In your case, 1184 models will need to be trained (in parallel) and run (also in parallel).

Mog · August 15, 2022, 9:34am

Even with only 400-500 primary labels the trees seemed very slow to run and sometimes crashing? Is that right or was it simply because I hadn’t found the progress indicator? It worked with only the two most common labels but that isn’t useful.

Mathieu · August 15, 2022, 9:58am

If you see a crash, can you share the logs. Maybe we can figure it out.

The training time depends on the size of the dataset and the hyper-parameter. During the pipeline development, it is a good idea to work with a small version of the dataset (e.g. 10k examples).
To increase the amount of logs printed during training, create your model with the verbose=2 constructor argument.

If you train a model for each label, each model is trained independently (and possibly in parallel). The number of labels should not influence individual model training. If you use a different setup, can you share details here?

Mog · August 15, 2022, 10:48am

I used 1k examples for development. But I only have 250k data points for English anyway. No crash logs, it just hangs when I use many targets. And I can kind of imagine that trees might not handle many possible output labels well…

But it’s fine. I need multilabel in the final model so I am not going to pursue this.