Efficiency generally refers to the extent to which a system, process, or effort produces results with minimal waste, effort, or resources. It is a measure of how well a task or objective is accomplished in relation to the resources expended. In various contexts, efficiency can be applied to different areas such as energy usage, time management, resource allocation, or overall productivity.

Posts

Improving the Efficiency-Accuracy Trade-off of DETR-Style Models in Practice

This report aims to provide a comprehensive view on the inference efficiency of DETR-style detection models. We provide the effect of the basic efficiency techniques and identify the factors that are easily applicable yet effectively improve the efficiency-accuracy trade-off. Specifically, we explore the effect of input resolution, multi-scale feature enhancement, and backbone pre-training. Our experiments support that 1) improving the detection accuracy for smaller objects while minimizing the increase in inference cost is a good strategy to achieve a better trade-off between accuracy and efficiency. 2) Multi-scale feature enhancement can be lightened with marginal accuracy loss and 3) improved backbone pre-training can further enhance the trade-off.

Tripping Through Time: Efficient Localization of Activities in Videos

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA [14], ActivityNet Captions [26] and the TACoS dataset [36], we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41% of the entire video.

15 Keypoints Is All You Need

Pose-tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames in a video. However, existing pose-tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient multi-person pose-tracking method, KeyTrack that only relies on keypoint information without using any RGB or optical flow to locate and track human keypoints in real-time. KeyTrack is a top-down approach that learns spatio-temporal pose relationships by modeling the multi-person pose-tracking problem as a novel Pose Entailment task using a Transformer-based architecture. Furthermore, KeyTrack uses a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used by the Transformers. We achieved state-of-the-art results on PoseTrack’17 and PoseTrack’18 benchmarks while using only a fraction of the computation used by most other methods for computing the tracking information.

Tripping Through Time: Efficient Temporal Localization of Activities in Videos

Localizing moments in untrimmed videos using language queries is a new task that requires the ability to accurately ground language into video. Existing approaches process the video, often more than once, to localize the activities and are inefficient. In this paper, we present TripNet, an end-to-end system which uses a gated attention architecture to model fine grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to skip around the video saving feature extraction and processing time. In our evaluation over Charades-STA and ActivityNet Captions dataset, we find that TripNet achieves high accuracy and only processes 32-41% of the entire video.