NEC Labs America is heading to Denver for CVPR 2026, one of the most prestigious gatherings in artificial intelligence and computer science. The IEEE/CVF Conference on Computer Vision and Pattern Recognition brings together researchers, engineers, and innovators from around the world to share breakthroughs in computer vision, machine learning, and pattern recognition.
Running June 3 through June 7, CVPR 2026 is a premier destination for anyone working at the frontier of visual AI. The conference draws thousands of attendees across workshops, tutorials, demos, and an expansive expo floor, making it one of the most dynamic events in the field. For us, it represents a valuable opportunity to connect with the global research community, explore cutting-edge developments, and bring fresh insights back to the region. Stay tuned for updates, takeaways, and highlights from our time at CVPR 2026.
Presentations
AUTOPILOT Workshop
Autonomous Understanding Through Open-world Perception and Integrated Language Models for On-road Tasks
- Workshop In conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026 – Denver)
- Manmohan Chandraker, Speaker
- Ali K. AlShami, Organizing Committee
- June 3, 2026
- Website: https://www.autopilot-cvpr.net/
AUTOPILOT is a workshop on safety-critical autonomous driving, bringing together academia and industry to explore robust perception, prediction, decision-making, and motion planning. The workshop highlights foundation and generative models for real-world deployment, with a focus on open-world hazards, ethics, and reproducible evaluation.
Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
- Poster Presentation
- Authors: Jingchen Sun (Presenter), Shaobo Han (Pictured, Presenter), Deep Patel, Wataru Kohno, Can Jin, Changyou Chen
- Code is available at https://github.com/Jingchensun/beta-kd
Abstract: Knowledge distillation establishes a learning paradigm that learns from both data supervision and teacher guidance. However, the optimal weighting between learning from data and learning from the teacher is hard to determine, as some samples are data-noisy while others are teacher-uncertain. This raises a pressing need to adaptively balance data and teacher supervision. We propose Beta-weighted Knowledge Distillation \textbf{-KD}, an adaptive, uncertainty-aware knowledge distillation framework that supports arbitrary distillation objectives under a unified Bayesian formulation. Specifically, we model teacher signals as a Gibbs prior over student activations and use amortized optimization to jointly infer activations and weighting parameters , leading to a closed-form, uncertainty-aware weighting. Extensive experiments distilling a 1.7B-parameter student from MobileVLM-7B demonstrate that -KD consistently outperforms existing methods under different loss combination settings. Moreover, large-scale distillation and evaluations on six multimodal benchmarks further confirm the effectiveness of the proposed approach.
Object-Aware 4D Human Motion Generation
- Virtual Presentation
- Deep Patel
- June 4, 2026
- Paper: https://www.nec-labs.com/blog/object-aware-4d-human-motion-generation/
Abstract: Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.
Anomaly Detection with Foundation Models (ADFM) Workshop
- Workshop In conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026 – Denver)
- Abhishek Aich, Organizing Committee
- June 4, 2026
- Website: https://adfmw.github.io/cvpr26/
Abhishek Aich of our Media Analytics department is serving as an organizer of the Anomaly Detection with Foundation Models (ADFM) Workshop at CVPR 2026 in Denver on June 4th. Foundation models have rapidly transformed fields ranging from healthcare and cybersecurity to finance and industrial systems. Yet one critical capability remains underexplored: the use of these powerful models for anomaly detection. As organizations increasingly rely on AI in high-stakes environments, the ability to identify unusual patterns, out-of-distribution inputs, and edge cases becomes essential to safety and reliability. ADFM 2026 workshop addresses this gap directly by providing a dedicated forum for researchers and practitioners to share recent breakthroughs, examine technical and ethical implications, and explore paths toward more robust and explainable anomaly detection systems.
AV Simulation Team
The NEC Laboratories America AV simulation team, led by Manmohan Chandraker, will present multiple papers at CVPR on Agentic Simulation. Hit us up to chat about training and validation on the long tail: World models, 3D scene editing, Diffusion models and Embodied and Physical AI.
Team: Zaid Tasneem, Ziyu Jiang, Aniket Roy, and Francesco Pittaluga.
LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents
- Project Leads: Yun He (intern), Zaid Tasneem
- Thursday, June 4th from 3:15 to 4:15 PM, SAD Workshop, Room 102/104
- Project: https://yunhe24.github.io/langdrivectrl/
Abstract: LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It represents each video as an explicit 3D scene graph, decomposing the scene into a static background and dynamic object nodes. To enable fine-grained editing and realism, it introduces a feedback-driven agentic pipeline. An Orchestrator converts user instructions into executable graphs that coordinate specialized multi-modal agents and tools. An Object Grounding Agent aligns free-form text with target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and harmonized using a video diffusion tool, and then further refined by a Video Reviewer Agent to ensure photorealism and appearance alignment. LangDriveCTRL supports both object node editing (removal, insertion, and replacement) and multi-object behavior editing from natural-language instructions. Quantitatively, it achieves nearly higher instruction alignment than the previous SoTA, with superior photorealism, structural preservation, and traffic realism.
HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes
- Project Leads: Mauricio Soroco (intern), Ziyu Jiang
- Friday, June 5 from 7:00 to 8:30 AM, Findings Posters, ExHall A
- Project: https://msoroco.github.io/horizonweaver/
- Paper: https://www.nec-labs.com/blog/horizonweaver-generalizable-multi-level-semantic-editing-for-driving-scenes/
Abstract: Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%.
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
Project Leads: Yifan Wang (intern), Ziyu Jiang
Saturday, June 6th from 4:45 to 6:45 PM, Poster Session 4 (Main), ExHall A
Project: https://horizonforge.github.io/
Paper: https://www.nec-labs.com/blog/horizonforge-driving-scene-editing-with-any-trajectories-and-any-vehicles/
Collaborators: Matthias Zwicker, Chenyu You, Wuyang Chen, Abhishek Aich, Bingbing Zhuang.
Abstract: Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second-best state-of-the-art method.















