Reasoning is the cognitive process of forming conclusions, judgments, or inferences from facts, evidence, or information. It involves the ability to think logically, analyze situations, and draw valid conclusions based on available data or knowledge. Reasoning is a fundamental aspect of human intelligence and is crucial for problem-solving, decision-making, and understanding complex relationships.


Peek-a-boo: Occlusion Reasoning in Indoor Scenes with Plane Representations

We address the challenging task of occlusion-aware indoor 3D scene understanding. We represent scenes by a set of planes, where each one is defined by its normal, offset and two masks outlining (i) the extent of the visible part and (ii) the full region that consists of both visible and occluded parts of the plane. We infer these planes from a single input image with a novel neural network architecture. It consists of a two-branch category-specific module that aims to predict layout and objects of the scene separately so that different types of planes can be handled better. We also introduce a novel loss function based on plane warping that can leverage multiple views at training time for improved occlusion-aware reasoning. In order to train and evaluate our occlusion-reasoning model, we use the ScanNet dataset and propose (i) a strategy to automatically extract ground truth for both visible and hidden regions and (ii) a new evaluation metric that specifically focuses on the prediction in hidden regions. We empirically demonstrate that our proposed approach can achieve higher accuracy for occlusion reasoning compared to competitive baselines on the ScanNet dataset, e.g. 42.65% relative improvement on hidden regions.

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Existing visual reasoning datasets, such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning, but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) – consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations.

Visual Entailment Task for Visually-Grounded Language Learning

We introduce a new inference task – Visual Entailment (VE) – which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30K. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

Baseline Needs More Love: On SimpleWord-Embedding-Based Models and Associated Pooling Mechanisms

Many deep learning architectures have been proposed to model the compositionality in text sequences, requiring substantial number of parameters and expensive computations. However, there has not been a rigorous evaluation regarding the added value of sophisticated compositional functions. In this paper, we conduct a point-by-point comparative study between Simple Word-Embedding-based Models (SWEMs), consisting of parameter-free pooling operations, relative to word-embedding-based RNN/CNN models. Surprisingly, SWEMs exhibit comparable or even superior performance in the majority of cases considered. Based upon this understanding, we propose two additional pooling strategies over learned word embeddings: (i) a max-pooling operation for improved interpretability; and (ii) a hierarchical pooling operation, which preserves spatial (n-gram) information within text sequences. We present experiments on 17 datasets encompassing three tasks: (i) (long) document classification; (ii) text sequence matching; and (iii) short text tasks, including classification and tagging.

Adaptive Memory Networks

Adaptive Memory Networks We present Adaptive Memory Networks (AMN) that processes input-question pairs to dynamically construct a network architecture optimized for lower inference times for Question Answering (QA) tasks. AMN processes the input story to extract entities and stores them in memory banks. Starting from a single bank, as the number of input entities increases, AMN learns to create new banks as the entropy in a single bank becomes too high. Hence, after processing an input-question(s) pair, the resulting network represents a hierarchical structure where entities are stored in different banks, distanced by question relevance. At inference, one or few banks are used, creating a tradeoff between accuracy and performance. AMN is enabled by dynamic networks that allow input dependent network creation and efficiency in dynamic mini-batching as well as our novel bank controller that allows learning discrete decision making with high accuracy. In our results, we demonstrate that AMN learns to create variable depth networks depending on task complexity and reduces inference times for QA tasks.