Visual Question Answering with Explicit Reasoning

ZHAN, HUAYI

doi:https://doi.org/10.21985/n2-6n7z-ax79

Work

Visual Question Answering with Explicit Reasoning

Public

Download PDF

Download All Files (.zip)

Recent encouraging advances in computer vision and natural language understanding shed light on a very interesting yet challenging task: asking and answering questions about a given image (VQA). The study of this research problem is still in its infancy. Most existing VQA methods are neural network-based (NN-based) solutions that pursue mappings from visual and language inputs to answers without much explicit inference or reasoning. While achieving good performance, these methods are heavily data-dependent and difficult to explain and generalize. In recent years, a number of methods have been proposed, such as Neural Module Network and Neural State Machine, which start to incorporate reasoning process into the neural network architecture and achieve state-of-the-art performances. The basic idea is to first parse the question into a set of logical modules and then perform sequential reasoning over the image to obtain the answer. Although explicit reasoning is implemented on the question understanding subtask, the whole model is still end-to-end based with neural networks as the backbone. Based on these observations, we propose a novel graph matching-based framework that enables better understanding of visual and textual inputs as well as their interactions to answer questions in a highly explainable manner. Specifically, for an input image and question pair, a scene graph $G_s$ and a query graph $G_q$ is constructed respectively, followed by a graph matching module which finds the best matching of $G_q$ in $G_s$. The matching result is either used as the answer itself (for closed-vocabulary dataset such as CLEVR) or jointly trained with a prediction module to infer the final answer (for open-vocabulary dataset such as GQA). In this thesis, We first illustrate in details the graph matching-based framework for Visual Question Answering. We start our study from closed-vocabulary VQA dataset where the number of object and relation categories are limited. This will ensure high quality scene graphs $G_s$ to be generated. In addition, when $G_s$ does not have sufficient information to answer the question, we develop techniques to infer missing information of $G_s$ using inference graph. We show that the introduction of inference graph greatly enhances the quality of generated scene graphs. For the question understanding part, a template-based strategy is implemented to generate the query graph $G_q$. Given the generated $G_s$ and $G_q$, we apply subgraph isomorphism to obtain the answer. Our approach achieves superior performance on two closed-vocabulary datasets: Soccer-VQA and CLEVR, which demonstrates the efficacy of our approach. We then present a novel hybrid framework for visual relation detection (VRD). As a crucial part of the scene graph generation model, VRD plays an important role in high-level image understanding tasks such as VQA which require capturing the interactions between detected objects. Most of the previous works on VRD only focus on local context or simple semantic information. Attending to gather richer information, we develop a VRD framework that combines the global features and traditional local features from both vision and semantics. Specifically, we propose a dual attention model (DAM) that performs visual and semantic attention in different modalities to gather necessary information from the global context. At the reasoning stage, global(local) semantic(visual) features are fused to predict pairwise relations between objects in the image. By combining local features with global ones, our model outperforms the state-of-the-art methods on two widely-used datasets: Visual Genome (VG) and Visual Relationship Dataset (VRD). Finally, we move to open-vocabulary VQA datasets to evaluate the generalizability of our approach for VQA. It is difficult to apply template-based strategy for question parsing as the huge capacity of the semantic space makes it intractable. We instead deploy a dependency parser followed by a salience measuring module to capture the dependency relations of salient words and phrases in the question, which will then be used to construct the query graph $G_q$. At the matching stage, we replace strict string matching with embedding-based soft matching to handle the large variation issue. We evaluate our model on the open-vocabulary dataset of GQA with ground-truth scene graphs and achieve perfect performance. When the graphs are noisily detected, our model can still achieve good accuracy, while offering higher degree of interpretability and generalizability.

Creator