Deep Patel NEC Labs AmericaDeep Patel is a Senior Associate Researcher in the Machine Learning Department at NEC Laboratories America in Princeton, NJ. He earned his Bachelor of Science (BS) in Computer Science from Towson University.

At NEC, Deep contributes to platforms for intelligent visual analytics, visual search, and vision-language interaction, helping to develop video-based reasoning models that operate in real-time across multi-camera systems.

His work includes optimizing neural architectures for embedded systems and designing scalable inference pipelines for video AI applications. He plays a key role in bringing NEC’s media analytics solutions from lab prototypes to production-ready systems used in smart cities and enterprise monitoring.

Posts

NEC Labs America Attends CVPR 2026 in Denver, CO June 3-7, 2026

NEC Labs America is heading to Denver for CVPR 2026, one of the most prestigious gatherings in computer vision, machine learning, and pattern recognition. The IEEE/CVF Conference on Computer Vision and Pattern Recognition brings innovators from around the world to share breakthroughs.

Training Small AI Models Without Blindly Trusting Big Teacher Models

Machine learning is shifting from learning from data alone to learning from both data and teacher models. Beta-KD uses uncertainty-aware Bayesian weighting to train compact multimodal AI without blindly trusting every teacher signal.

Making Video AI Fast Enough for the Real World

State-of-the-art video models are accurate but too slow for live deployment. This work transfers their knowledge into causal streaming models that process video frames in real time, achieving 4x lower latency with competitive accuracy across action detection and pedestrian intent tasks.

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher–student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods.

Distilling Offline Action Detection Models into Real-Time Streaming Models

Vision Transformers (ViTs) have achieved state-of-the-art performance in offline video action detection, but their reliance on processing fixed-size clips with full spatio-temporal attention makes them computationally expensive and ill-suited for real-time streaming applications due to massive computational redundancy. This paper introduces a novel framework to adapt these powerful offline models into efficient, online student models through knowledge distillation. We propose two causal, streaming-friendly attention architectures that replace the full self-attention mechanism: (1) a lightweight Temporal Shift Attention that integrates past context with minimal overhead, and (2) a Decomposed Spatial-Temporal Attention that combines intra-frame spatial attention with an LSTM for temporal modeling. Both architectures utilize caching to eliminate redundant operations on a frame-by-frame basis. To maximize knowledge transfer, we introduce an uncertainty-guided distillation process, which formulates the training as a multi-task learning problem. Our resulting models demonstrate significant efficiency gains, achieving up to a4x improvement in latency and throughput compared to the original offline teacher while ensuring state-of-the-art performance for online methods. Our work provides a practical and effective methodology for deploying high-accuracy transformer models in latency-sensitive, real-world video analysis systems.

Object-Aware 4D Human Motion Generation

Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.

DiscussLLM: Teaching Large Language Models When to Speak

Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an “awareness gap,” limiting their potential as truly collaborative partners in dynamic human discussions. We introduce , a framework designed to bridge this gap by training models to proactively decide not just to say, but critically, to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.

Group Relative Augmentation for Data Efficient Action Detection

Adapting large Video-Language Models (VLMs) for action detection using only a few examples poses challenges like overfitting and the granularity mismatch between scene-level pre-training and required person-centric understanding. We propose an efficient adaptation strategy combining parameter-efficient tuning (LoRA) with a novel learnable internal feature augmentation. Applied within the frozen VLM backbone using FiLM, these augmentations generate diverse feature variations directly relevant to the task. Additionally, we introduce a group-weighted loss function that dynamically modulates the training contribution of each augmented sample based on its prediction divergence relative to the group average. This promotes robust learning by prioritizing informative yet reasonable augmentations. We demonstrate our method’s effectiveness on complex multi-label, multi-person action detection datasets (AVA, MOMA), achieving strong mAP performance and showcasing significant data efficiency for adapting VLMs from limited examples.

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection (WACV)

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMBand T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions.

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions.