Visual Entailment Task for Visually-Grounded Language Learning
Publication Date: 12/7/2018
Event: NeurIPS 2018 workshop on Visually Grounded Interaction and Language (ViGIL)
Reference: pp. 1-7, 2018
Authors: Ning Xie, Wright State University; Farley Lai, NEC Laboratories America, Inc.; Derek Doran, Wright State University; Asim Kadav, NEC Laboratories America, Inc.
Abstract: We introduce a new inference task – Visual Entailment (VE) – which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30K. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.
Publication Link: https://nips2018vigil.github.io/static/papers/accepted/5.pdf
Additional Publication Link: https://arxiv.org/pdf/1811.10582.pdf