Multilabel categorization and Tensorflow Decision Forests

Hello everybody

I am trying to learn how to use TFDF. I keep hearing that decision trees based models are fast to train and query so I wanted to see how much I could use them at work.

I have a multilabel classification task so I used StringLookup to map the labels into a multihot vector. The GradientBoostedTreesModel is complaining ValueError: Can not squeeze dim[1], expected a dimension of 1, got 1184. I have 1184 unique labels so the transform is fine but apparently that’s not an acceptable format for the target. Is this out of scope for these models? Is it just that I need to pass some parameter to the model object when I create it? I think I saw that multiclass is handled out of the box so do I “simply” need to split the vector into one tensor for each class? That feels a bit inefficient.


Edit: PS is there a better place/forum to ask this question?

Hi Mog,

I think @Mathieu can help answer the question here

Hello Mog,

TF-DF / YDF does not support multi-label classification. While research works [1] exist on this topic, support for multi-label classification with decision forests is more complex than with other models (such as Neural Networks), and so we don’t offer any solution for this yet.

As you mentioned, predicting each label independently is not be suited for all applications: In your case, 1184 models will need to be trained (in parallel) and run (also in parallel).


Even with only 400-500 primary labels the trees seemed very slow to run and sometimes crashing? Is that right or was it simply because I hadn’t found the progress indicator? It worked with only the two most common labels but that isn’t useful.

If you see a crash, can you share the logs. Maybe we can figure it out.

The training time depends on the size of the dataset and the hyper-parameter. During the pipeline development, it is a good idea to work with a small version of the dataset (e.g. 10k examples).
To increase the amount of logs printed during training, create your model with the verbose=2 constructor argument.

If you train a model for each label, each model is trained independently (and possibly in parallel). The number of labels should not influence individual model training. If you use a different setup, can you share details here?

1 Like

I used 1k examples for development. But I only have 250k data points for English anyway. No crash logs, it just hangs when I use many targets. And I can kind of imagine that trees might not handle many possible output labels well…

But it’s fine. I need multilabel in the final model so I am not going to pursue this.