Making Video AI Fast Enough for the Real World
A preoccupied pedestrian steps off the curb at a busy intersection. A robot arm reaches for the wrong part. A drowning swimmer’s stroke pattern changes two minutes before they go under.
In each case, the AI model watching the scene has milliseconds to decide what is happening — not seconds, not minutes. This is the gap that most state-of-the-art video AI cannot bridge.
The models that perform best are also the slowest: they watch an entire clip of frames at once, compute relationships across every pixel in space and time, and only then produce an answer. By the time the answer arrives, the pedestrian has already crossed.
In Distilling Offline Action Detection Models into Real-Time Streaming Models, Deep Patel, Iain Melvin, and Martin Renqiang Min of NEC Laboratories America, together with Yasunori Babazaki and Yasuto Nagase of NEC Corporation, offer a principled way to close that gap without rebuilding the best models from scratch. Presented at WACV 2026, the work introduces a framework for transforming high-accuracy offline video models into efficient, real-time streaming systems.
The key takeaway is that this video AI research retains performance while cutting latency by up to 4x.
The Problem with Powerful Models
Vision Transformers have become the dominant approach for video action detection because their self-attention mechanism can capture rich, long-range relationships across both space and time. Feed a ViT a clip of frames, and it will find dependencies that simpler models miss.
But this strength is also the bottleneck. These models process a fixed window of frames all at once. In a real-time setting, as each new frame arrives, the model must reprocess the entire overlapping window from scratch. If your window is 8 frames long, 7 of those frames were already computed in the previous step. That redundancy compounds across layers, quickly making real-time deployment impractical.
A natural fix might be caching, saving computations from previous frames and reusing them. But for standard transformer architectures, caching creates a subtle and damaging mismatch. Models are trained on clean, self-contained clips, and during inference, they rely on stale cached states from prior windows. Performance degrades in ways that are hard to predict and hard to fix.
The approach taken in the paper sidesteps this problem entirely. Rather than forcing an offline model to behave like a streaming one, they train a new model, a student, that is causal from the ground up.
The Core Idea: Teach, Don’t Retrofit
The framework is built on knowledge distillation. A powerful offline “teacher” model watches full video clips with full spatiotemporal attention and produces rich supervisory signals. A lightweight “student” model learns to replicate the teacher’s behavior but is designed to process one frame at a time using cached state from prior frames.
Crucially, the distillation is self-supervised. The student learns from the teacher’s outputs rather than ground-truth labels. This makes the trained student highly portable. Deploying it in a new environment does not require re-annotating data.
Two student architectures are introduced in the paper:
- Temporal Shift Attention (TSA) is the lightweight option. It shifts a fraction of feature channels from the previous frame into the current frame before computing attention. This creates a continuous temporal thread with minimal computational cost.
- Decomposed Spatial-Temporal Attention (DSTA) separates spatial and temporal processing. Spatial attention is computed per frame, while an LSTM carries temporal information forward using cached states.
A Smarter Way to Balance What to Learn
Distillation typically aligns the student’s outputs with the teacher’s outputs. This work goes further by transferring knowledge from three sources: action logits, intermediate attention maps, and region-of-interest features.
Instead of fixed loss weights, the paper introduces uncertainty-guided weighting. Each loss term has a learnable uncertainty parameter. Less reliable signals are down-weighted automatically, while stable signals are emphasized.
This leads to a clear pattern: early transformer layers receive more weight, while later layers are deemphasized. This aligns with the idea that foundational features are more important to transfer.
Results: Fast Enough for the Real World
On the AVA v2.2 benchmark, the student models match the accuracy of much larger models while running more than 4x faster. The lightest version reaches 217 FPS, making it viable for edge deployment.
The models also generalize well to new domains without additional labeled data. On a pedestrian dataset, TSA comes within 3 points of the teacher’s accuracy with only 870 training clips.
What This Enables
This work narrows the gap between offline accuracy and real-time performance.
Applications include:
* Pedestrian intent prediction before actions occur
* Real-time robotics perception
* Live video monitoring on limited hardware
The framework itself is general. Any high-performance offline model can serve as a teacher, enabling deployment in streaming environments.
Limitations and Next Steps
Two limitations remain:
* Reliance on a separate person detector for bounding boxes
* Limited performance on very long temporal sequences
Future work may include end-to-end architectures and longer-range memory models.
The broader takeaway is that high-performance models can be adapted for real-time use through principled distillation.
About The Authors
Deep Patel is a Senior Associate Researcher in the Machine Learning Department at NEC Laboratories America in Princeton, NJ. He earned his Bachelor of Science (BS) in Computer Science from Towson University. At NEC, Deep contributes to platforms for intelligent visual analytics, visual search, and vision-language interaction, helping develop video-based reasoning models that operate in real time across multi-camera systems. His work includes optimizing neural architectures for embedded systems and designing scalable inference pipelines for video AI applications.
Iain Melvin is a Researcher in the Machine Learning Department of NEC Laboratories America, where he develops scalable and trustworthy AI systems for real-world applications. His work focuses on human–computer interaction, user interface development, and full-stack systems that enable effective interaction with complex AI models. He builds cloud-based platforms and data pipelines that support intelligent document analysis, collaborative language model agents, and video understanding across domains such as healthcare and manufacturing.
Martin Renqiang Min is the Department Head of the Machine Learning Department at NEC Laboratories America. He holds a Ph.D. in Computer Science from the University of Toronto and completed postdoctoral research at Yale University, where he also taught courses on deep learning. At NEC, Dr. Min directs a multidisciplinary research team at the forefront of foundational and applied artificial intelligence. His portfolio spans deep learning, natural language understanding, multimodal learning, visual reasoning, and the application of machine learning to biomedical and healthcare data.
Publication to Blog Post Series
Our Publication-to-Blog Post Series highlights the real-world impact of our latest research, translating complex innovations into practical applications. From AI and machine learning to optical networking and intelligent systems, we showcase how our work goes beyond theory to address real-world challenges. Explore how cutting-edge research at NEC Laboratories America is driving measurable outcomes across industries.








