Learning multimodal entailment

Sentence 1: Sourav Ganguly is the greatest captain in BCCI.
Sentence 2: Ricky Ponting is the greatest captain in Cricket Australia.

Do these two sentences contradict/entail each other or are they neutral? In NLP, this problem is known as textual entailment and is a part of the GLUE benchmark for language understanding.

On social media platforms, to better curate and moderate content, we often need to utilize multiple sources of data to understand their semantic behavior. This is where multimodal entailment can be useful. In my latest post, I introduce the basics of this topic and present a set of baseline models for the Multimodal Entailment dataset recently introduced by Google. Some recipes include “modality dropout”, cross-attention, and class-imbalance mitigation.

Fun fact: This marks the 100th example on keras.io.


Thanking @markdaoust, @lgusm, and @jbgordon for the amazing tutorial on

My post on multimodal entailment uses a fair bit of code from that post (of course with due citation). With that, I wanted to take the opportunity to thank you, folks, for the tutorial since it DEFINITELY helps in solving GLUE tasks more accessible and readily approachable.

1 Like

Thanks Sayak!!

I’m very glad that it was helpful!