vision language models Archives

Vision-Language Models (VLMs) are models that integrate both visual and textual data, enabling them to convert visual inputs, such as traffic camera footage, into textual descriptions. VLMs are crucial for interpreting video data in a form that can be processed by text-based systems like Large Language Models (LLMs). In traffic video analysis, VLMs are applied to convert large amounts of video footage into meaningful text, which can then be used for tasks such as traffic management and incident investigation.

Posts

Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine

June 3, 2026/in News/by NEC Labs America

Satellite imagery is vast, high-resolution, and rich with information, but finding specific objects within it using natural language has remained a stubborn challenge. Open-SAT, developed by researchers at NEC Laboratories America and North South University, tackles this problem without retraining any models.

Training Small AI Models Without Blindly Trusting Big Teacher Models

May 27, 2026/in News/by NEC Labs America

Machine learning is shifting from learning from data alone to learning from both data and teacher models. Beta-KD uses uncertainty-aware Bayesian weighting to train compact multimodal AI without blindly trusting every teacher signal.

Celebrating the Women of NEC Laboratories America

March 5, 2026/in News/by NEC Labs America

NEC Laboratories America celebrates Women’s History Month and International Women’s Day by recognizing the women researchers, engineers, and interns whose work in artificial intelligence, optical networking, cybersecurity, and data science is helping shape the future of technology.

TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

September 24, 2024/in Publications/by NEC Labs America

Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to 4× while maintaining information accuracy.

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

June 17, 2024/in Publications/by NEC Labs America

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method’s superior performance at a reduced cost.

Posts

Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine

Training Small AI Models Without Blindly Trusting Big Teacher Models

Celebrating the Women of NEC Laboratories America

TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

Contact Us

About Us

Our Pages

Recent Publications

Events

News

Tag Archive for: vision language models

Posts

Contact Us

About Us

Our Pages

Recent Publications

Events

News