Making Video AI Fast Enough for the Real World

A preoccupied pedestrian steps off the curb at a busy intersection. A robot arm reaches for the wrong part. A drowning swimmer’s stroke pattern changes two minutes before they go under.

In each case, the AI model watching the scene has milliseconds to decide what is happening — not seconds, not minutes. This is the gap that most state-of-the-art video AI cannot bridge.

Making Video AI Fast Enough for the Real World

The models that perform best are also the slowest: they watch an entire clip of frames at once, compute relationships across every pixel in space and time, and only then produce an answer. By the time the answer arrives, the pedestrian has already crossed.

In Distilling Offline Action Detection Models into Real-Time Streaming Models, Deep Patel, Iain Melvin, and Martin Renqiang Min of NEC Laboratories America, together with Yasunori Babazaki and Yasuto Nagase of NEC Corporation, offer a principled way to close that gap without rebuilding the best models from scratch. Presented at WACV 2026, the work introduces a framework for transforming high-accuracy offline video models into efficient, real-time streaming systems.

The key takeaway is that this video AI research retains performance while cutting latency by up to 4x.

The Problem with Powerful Models

Vision Transformers have become the dominant approach for video action detection because their self-attention mechanism can capture rich, long-range relationships across both space and time. Feed a ViT a clip of frames, and it will find dependencies that simpler models miss.

But this strength is also the bottleneck. These models process a fixed window of frames all at once. In a real-time setting, as each new frame arrives, the model must reprocess the entire overlapping window from scratch. If your window is 8 frames long, 7 of those frames were already computed in the previous step. That redundancy compounds across layers, quickly making real-time deployment impractical.

A natural fix might be caching, saving computations from previous frames and reusing them. But for standard transformer architectures, caching creates a subtle and damaging mismatch. Models are trained on clean, self-contained clips, and during inference, they rely on stale cached states from prior windows. Performance degrades in ways that are hard to predict and hard to fix.

The approach taken in the paper sidesteps this problem entirely. Rather than forcing an offline model to behave like a streaming one, they train a new model, a student, that is causal from the ground up.

The Core Idea: Teach, Don’t Retrofit

The framework is built on knowledge distillation. A powerful offline “teacher” model watches full video clips with full spatiotemporal attention and produces rich supervisory signals. A lightweight “student” model learns to replicate the teacher’s behavior but is designed to process one frame at a time using cached state from prior frames.

Crucially, the distillation is self-supervised. The student learns from the teacher’s outputs rather than ground-truth labels. This makes the trained student highly portable. Deploying it in a new environment does not require re-annotating data.

Two student architectures are introduced in the paper:

  1. Temporal Shift Attention (TSA) is the lightweight option. It shifts a fraction of feature channels from the previous frame into the current frame before computing attention. This creates a continuous temporal thread with minimal computational cost.
  2. Decomposed Spatial-Temporal Attention (DSTA) separates spatial and temporal processing. Spatial attention is computed per frame, while an LSTM carries temporal information forward using cached states.

A Smarter Way to Balance What to Learn

Distillation typically aligns the student’s outputs with the teacher’s outputs. This work goes further by transferring knowledge from three sources: action logits, intermediate attention maps, and region-of-interest features.

Instead of fixed loss weights, the paper introduces uncertainty-guided weighting. Each loss term has a learnable uncertainty parameter. Less reliable signals are down-weighted automatically, while stable signals are emphasized.

This leads to a clear pattern: early transformer layers receive more weight, while later layers are deemphasized. This aligns with the idea that foundational features are more important to transfer.

Results: Fast Enough for the Real World

On the AVA v2.2 benchmark, the student models match the accuracy of much larger models while running more than 4x faster. The lightest version reaches 217 FPS, making it viable for edge deployment.

The models also generalize well to new domains without additional labeled data. On a pedestrian dataset, TSA comes within 3 points of the teacher’s accuracy with only 870 training clips.

What This Enables

This work narrows the gap between offline accuracy and real-time performance.

Applications include:

* Pedestrian intent prediction before actions occur
* Real-time robotics perception
* Live video monitoring on limited hardware

The framework itself is general. Any high-performance offline model can serve as a teacher, enabling deployment in streaming environments.

Limitations and Next Steps

Two limitations remain:

* Reliance on a separate person detector for bounding boxes
* Limited performance on very long temporal sequences

Future work may include end-to-end architectures and longer-range memory models.

The broader takeaway is that high-performance models can be adapted for real-time use through principled distillation.

About The Authors

Publication to Blog Post Series

Our Publication-to-Blog Post Series highlights the real-world impact of our latest research, translating complex innovations into practical applications. From AI and machine learning to optical networking and intelligent systems, we showcase how our work goes beyond theory to address real-world challenges. Explore how cutting-edge research at NEC Laboratories America is driving measurable outcomes across industries.

Driving the Future of Scene Editing with HorizonForge

Driving the Future of Scene Editing with HorizonForge

HorizonForge introduces a new approach to driving scene generation, enabling precise control over both vehicle behavior and identity. By allowing arbitrary trajectories and flexible vehicle insertion, it creates realistic, scalable simulations for autonomous driving, digital twins, and advanced AI development.
Beyond Explainability How We Are Redefining Interpretability in AI

Beyond Explainability: How We Are Redefining Interpretability in AI

AI interpretability has long been the focus, but what if it’s only part of the story? New research introduces model semantics, a framework for understanding what AI systems truly represent and how their internal structures connect to real-world phenomena.