Accuracy metrics in multi-class multi-label classification

I use “accuracy” as a metric in the model.compile method for a multi-class multi-label classification problem. It yields poor accuracy numbers in the model.fit and model.evaluate methods. However, when I use “y_hat = model.predict(X_val)” and compare the results to Y_val, the accuracy is close to 99%. Can someone please advise where I did wrong?

The model is very simple:

model = Sequential()
model.add(Input(shape=(Num_Feats,1)))
model.add(BatchNormalization())
model.add(Conv1D(64, 8, strides=1, activation=‘relu’))
model.add(Dropout(0.4))
model.add(Flatten())
model.add(BatchNormalization())
model.add(Dense(32, activation=‘relu’))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(Num_Chems, activation=‘sigmoid’))

model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’])

history=model.fit(X_train, Y_train, validation_data=[X_val, Y_val], epochs=1000, batch_size=128, verbose=1)

The X_train has the shape of (11050,403) and the Y_train has the shape of (11050,5). When I use the evaluate method, it yields 0.336 accuracy. Then I try the following:

y_hat=model.predict(X_val)
accuracy = accuracy_score(y_hat.round(), Y_val)

It yields 0.989 accuracy. Please review and advise. Thank you.

In multi-class multi-label classification problems, the “accuracy” metric as defined in Keras is not appropriate because it expects that only one class is the correct prediction for each sample, which is the scenario for single-label classification problems. Since you have a multi-label problem, where each sample can belong to multiple classes simultaneously, you need a different way to measure accuracy.

The accuracy_score from scikit-learn that you’re using after calling model.predict() and rounding the results is likely giving you a different measure of accuracy which is more suited for multi-label classification. This function computes the subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in Y_val. However, this metric can be too strict because it requires an all-or-nothing perfect match of all labels for each sample.

Instead of using ‘accuracy’, you might want to use other metrics that are better suited for multi-label classification, such as:

Hamming Loss: This measures the fraction of the wrong labels to the total number of labels. It is a more relaxed metric than subset accuracy because it doesn’t require all labels for a sample to be correct.
F1 Score: This is the harmonic mean of precision and recall, and it can be calculated for each label and then averaged across all labels, which is known as macro averaging.

Very good point, Ajay. To verify the concept, I changed the metric from “accuracy” to “mae”. It works nicely!

1 Like

@Ajay_Krishna What is the right way to use f1_score as the metric? I replaced “metrics=[‘mae’]” in the model.compile statement with “metrics=[‘f1_score’]”, but it generated errors in the model.fit.

Firstly calculate the confusion matrix. Now calculate the accuracy, precision, sensitivity and recall. From this it is easy to calculate F-1 score.

F1 score:
The F1 score is calculated as follows:

F1 score = 2 * (Precision * Recall) / (Precision + Recall)

Another metric I used is the Jaccard coefficient, ROC

Jaccard Similarity: Measures the similarity between the predicted labels and the true labels for each instance. A higher Jaccard Similarity indicates the model is accurately predicting the presence or absence of labels for each class. I remember defining it for my use case and it’s been a while.

Keep in mind that confusion matrix is for binary classification and you need to compare it between all the labels present and vice versa.

Sorry, I was not clear with my question. Let me try again. I used the following statement, but received error messages when I did the model.fit().

model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘f1_score’])

Is there something special about how to use f1_score in the compile statement?

Try instead
model.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[tf.keras.metrics.F1Score()])

More here.

I tried that, but received the following error message:

AttributeError: module ‘keras.api._v2.keras.metrics’ has no attribute ‘F1Score’

I am using TF V2.8.3.

Yes, tf.keras.metrics.F1Score first appeared in the 2.13.0 release.

You may implement @Ajay_Krishna’s formula in your own code, given that it’s not available in your version of TensorFlow. Here is one implementation I found in a Stack Overflow post:

class F1_Score(tf.keras.metrics.Metric):

    def __init__(self, name='f1_score', **kwargs):
        super().__init__(name=name, **kwargs)
        self.f1 = self.add_weight(name='f1', initializer='zeros')
        self.precision_fn = Precision(thresholds=0.5)
        self.recall_fn = Recall(thresholds=0.5)

    def update_state(self, y_true, y_pred, sample_weight=None):
        p = self.precision_fn(y_true, y_pred)
        r = self.recall_fn(y_true, y_pred)
        # since f1 is a variable, we use assign
        self.f1.assign(2 * ((p * r) / (p + r + 1e-6)))

    def result(self):
        return self.f1

    def reset_states(self):
        # we also need to reset the state of the precision and recall objects
        self.precision_fn.reset_states()
        self.recall_fn.reset_states()
        self.f1.assign(0)
1 Like

One more question on this subject: how do I tell if a metric method can be applied to multi-label? Or it is just for single-lable?

That’s a good question. In my opinion it is always a yes or no for a class. If it is a multi class classification the class that you are trying to find out will be 1 and all other classes will be 0 and this goes on until all the classes in the images are covered. So we can use all the metrics but some of the metrics are more effective than others as well as there ease of usage.