The code “Image Captioning with Visual Attention” is already working fine. Still, I’m looking for a way to know how good this works with my dataset. Is there any chance to implement the Evaluation metrics like BLEU, ROUGE, or METEOR?
I’m looking for updates every week, and I hope to access the evaluation metrics implementation as soon as possible.
Thanks a lot!
Bhack
March 30, 2022, 5:33pm
#3
We are trying to collect some reusable contributions on these metrics at:
opened 04:03AM - 09 Mar 22 UTC
@mattdangerw and the keras-nlp team:
For standard classification metrics (AUC… , F1, Precision, Recall, Accuracy, etc.), [keras.metrics](https://keras.io/api/metrics/) can be used. But there are several NLP-specific metrics which can be implemented here, i.e., we can expose native APIs for these metrics.
I would like to take this up. I can start with the popular ones first and open PRs. Let me know if this is something the team is looking to add!
I've listed a few metrics (this list is, by no means, comprehensive):
- Perplexity
- ROUGE
[paper](https://aclanthology.org/W04-1013)
Pretty standard metric for text generation. We can implement all variations: ROUGE-N, ROUGE-L, ROUGE-W, etc.
- BLEU
[paper](https://aclanthology.org/P02-1040/)
Another standard text generation metric.
Note: We can also implement [SacreBleu](https://github.com/mjpost/sacrebleu).
- BertScore
[paper](https://arxiv.org/abs/1904.09675), [code](https://github.com/Tiiiger/bert_score)
- Bleurt
[paper](https://arxiv.org/abs/2004.04696), [code](https://github.com/google-research/bleurt)
- (character n-gram F-score) chrF and chrF++
[paper](https://aclanthology.org/W15-3049/), [code](https://github.com/m-popovic/chrF)
- COMET
[paper](https://aclanthology.org/2020.emnlp-main.213/), [code](https://github.com/Unbabel/COMET)
- Character Error Rate, Word Error Rate, etc.
[paper](https://www.semanticscholar.org/paper/From-WER-and-RIL-to-MER-and-WIL%3A-improved-measures-Morris-Maier/8516531ff3bd874b66b811f0bd4e21a2d6b10e54)
- Pearson Coefficient and Spearman Coefficient
Looks like `keras.metrics` does not have these two metrics. They are not NLP-specific metrics...so, maybe, implementing them in Keras is better than implementing them here.
Thank you!