How to prepare data for multi-label classification with Bert?

Martin · June 26, 2021, 8:42am

If I want to do a multi-label text classification task, not multi-class classification, and my data is in this format:
1 this is a test. 0,0,1,0
2 this is another test 0,1,1,1
3 one more test 1,0,0,1

How should I prepare my data so that Keras preprocessing API can easily create TF.DataSet from it? For single label classification, I can use this format (one file directory per class) as below from the Keras/TF tutorial. But if my task is multi-label classification, how should I go about this and make tf.keras.preprocessing.text_dataset_from_directory still works with my data?

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

Bhack · June 26, 2021, 4:31pm

You cannot use this directly for this kind of multi labels.

See this example, also if It is for images It Is quite the same: