Offline to Online Streaming Distillation of Action Detection Models
Publication Date: 3/6/2026
Event: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026, Tucson, Arizona
Reference: pp. 6205-6214, 2026
Authors: Deep Patel, NEC Laboratories America, Inc.; Yasunori Babazaki, NEC Corporation; Nagase Yasuto, NEC Corporation; Iain Melvin, NEC Laboratories America, Inc.; Martin Renqiang Min, NEC Laboratories America, Inc.
Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance in offline video action detection, but their reliance on processing fixed-size clips with full spatio-temporal attention makes them computationally expensive and ill-suited for real-time streaming applications due to massive computational redundancy. This paper introduces a novel framework to adapt these powerful offline models into efficient, online student models through knowledge distillation. We propose two causal, streaming-friendly attention architectures that replace the full self-attention mechanism: (1) a lightweight Temporal Shift Attention that integrates past context with minimal overhead, and (2) a Decomposed Spatial-Temporal Attention that combines intra-frame spatial attention with an LSTM for temporal modeling. Both architectures utilize caching to eliminate redundant operations on a frame-by-frame basis. To maximize knowledge transfer, we introduce an uncertainty-guided distillation process, which formulates the training as a multi-task learning problem. Our resulting models demonstrate significant efficiency gains, achieving up to a 4x improvement in latency and throughput compared to the original offline teacher while ensuring state-of-the-art performance for online methods. Our work provides a practical and effective methodology for deploying high-accuracy transformer models in latency-sensitive, real-world video analysis systems.
Publication Link: https://openaccess.thecvf.com/content/WACV2026/papers/Patel_Distilling_Offline_Action_Detection_Models_into_Real-Time_Streaming_Models_WACV_2026_paper.pdf


