Publication Date: 5/16/2019
Event: CVPR 2019
Reference: pp. 1-7, 2019
Authors: Meera Hahn, Georgia Tech, NEC Laboratories America, Inc.; Asim Kadav, NEC Laboratories America, Inc.; James M. Rehg, Georgia Tech; Hans Peter Graf, NEC Laboratories America, Inc.
Abstract: Localizing moments in untrimmed videos using language queries is a new task that requires ability to accurately ground language into video. Existing approaches process the video, often more than once to localize the activities and are inefficient. In this paper, we present TripNet, an end-to-end system which uses a gated attention architecture to model fine grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to skip around the video saving feature extraction and processing time. In our evaluation over Charades-STA and ActivityNet Captions dataset, we find that TripNet achieves high accuracy and only processes 32-41% of the entire video.