Video Understanding refers to the ability of a system, typically a computer or artificial intelligence model, to interpret and comprehend the content of videos. This involves extracting meaningful information from video data, recognizing patterns, and making sense of the visual and temporal aspects present in the video stream.

Posts

StreamingRAG: Real-time Contextual Retrieval and Generation Framework

Extracting real-time insights from multi-modal data streams from various domains such as healthcare, intelligent transportation, and satellite remote sensing remains a challenge. High computational demands and limited knowledge scope restrict the applicability of Multi-Modal Large Language Models (MM-LLMs) on these data streams. Traditional Retrieval-Augmented Generation (RAG) systems address knowledge limitations of these models, but suffer from slow preprocessing, making them unsuitable for real-time analysis. We propose StreamingRAG, a novel RAG framework designed for streaming data. StreamingRAG constructs evolving knowledge graphs capturing scene-object-entity relationships in real-time. The knowledge graph achieves temporal-aware scene representations using MM-LLMs and enables timely responses for specific events or user queries. StreamingRAG addresses limitations in existing methods, achieving significant improvements in real-time analysis (5-6x faster throughput), contextual accuracy (through a temporal knowledge graph), and reduced resource consumption (using lightweight models by 2-3x).

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects. We approach the task by modeling the video as tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, prior works suffer from scene biases with privacy and ethical concerns. We only use the keypoint modality which reduces scene biases and prevents acquiring detailed visual data that may contain private or biased information of users. We improve the multiscale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and data augmentations tailored to the keypoint signals to aid model training. We demonstrate the model’s strength and interpretability on two widely-used datasets (Volleyball and Collective Activity). COMPOSER achieves up to +5.4% improvement with just the keypoint modality (Code is available at https://github.com/hongluzhou/composer.).

Learning Higher-order Object Interactions for Keypoint-based Video Understanding

Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation and may capture scene context using various modalities that further increases compute costs. Efficient methods such as those used for AR/VR often only use human-keypoint information but suffer from a loss of scene context that hurts accuracy. In this paper, we describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition. Specifically, KeyNet introduces the use of object based keypoint information to capture context in the scene. Our method illustrates how to build a structured intermediate representation that allows modeling higher-order interactions in the scene from object and human keypoints without using any RGB information. We find that KeyNet is able to track and classify human actions at just 5 FPS. More importantly, we demonstrate that object keypoints can be modeled to recover any loss in context from using keypoint information over AVA action and Kinetics datasets.

Tripping Through Time: Efficient Localization of Activities in Videos

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA [14], ActivityNet Captions [26] and the TACoS dataset [36], we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41% of the entire video.

Tripping Through Time: Efficient Temporal Localization of Activities in Videos

Localizing moments in untrimmed videos using language queries is a new task that requires the ability to accurately ground language into video. Existing approaches process the video, often more than once, to localize the activities and are inefficient. In this paper, we present TripNet, an end-to-end system which uses a gated attention architecture to model fine grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to skip around the video saving feature extraction and processing time. In our evaluation over Charades-STA and ActivityNet Captions dataset, we find that TripNet achieves high accuracy and only processes 32-41% of the entire video.

Attend and Interact: Higher-Order Object Interactions for Video Understanding

Human actions often involve complex interactions across several inter-related objects in the scene. However, existing approaches to fine-grained video understanding or visual relationship detection often rely on single object representation or pairwise object relationships. Furthermore, learning interactions across multiple objects in hundreds of frames for video is computationally infeasible and performance may suffer since a large combinatorial space has to be modeled. In this paper, we propose to efficiently learn higher-order interactions between arbitrary subgroups of objects for fine-grained video understanding. We demonstrate that modeling object interactions significantly improves accuracy for both action recognition and video captioning, while saving more than 3-times the computation over traditional pairwise relationships. The proposed method is validated on two large-scale datasets: Kinetics and ActivityNet Captions. Our SINet and SINet-Caption achieve state-of-the-art performances on both datasets even though the videos are sampled at a maximum of 1 FPS. To the best of our knowledge, this is the first work modeling object interactions on open domain large-scale video datasets, and we additionally model higher-order object interactions which improves the performance with low computational costs.

Adaptive Feature Abstraction for Translating Video to Text

Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable context-dependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach to generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, which adaptively and sequentially focuses on different layers of CNN features (levels of feature “abstraction”), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.