Media Analytics | Vijay Kumar B G

MEDIA ANALYTICS

PROJECTS

PEOPLE

PUBLICATIONS

PATENTS

Vijay Kumar B G

Senior Researcher

Media Analytics

Projects

Agentic LLMs for AI Orchestration

Overview: We develop an agentic LLM to solve complex workflows by deploying a combination of computer vision, logic and compute modules. Based on a natural language task specification, our LLM generates a plan to accomplish the task using available tools.

Embodied AI

Overview: We develop embodied agents for robotics applications that require exploration, navigation and transport in complex scenes. Our modular hierarchical transport policy builds a topological graph of the scene to perform exploration, then combined motion planning algorithms to reach point goals within explored locations with object navigation policies for moving towards semantic targets at unknown locations.

Foundational Vision-Language Models

Overview: Our foundational models enable ubiquitous usage of computer vision across scenarios, applications and user preferences.

Open Vocabulary Perception

Overview: We develop open vocabulary perception methods that combine the power of vision and language to provide rich descriptions of objects in scenes, including their attributes, behaviors, relations and interactions.

Publications

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

March 25, 2025/ArXiV

Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown

Agentic LLMs for AI Orchestration Project: Revolutionizing Complex Workflows

August 8, 2024

The development of Agentic LLMs for AI Orchestration represents a significant advancement in artificial intelligence. By seamlessly integrating computer vision, logic, and compute modules, our LLM is poised to revolutionize the way complex workflows are managed and executed. Supported by robust research

Foundational Vision-LLM for AI Linkage and Orchestration

July 3, 2024/NEC Technical Journal, Special Issue on Revolutionizing Business Practices with Generative AI

We propose a vision-LLM framework for automating development and deployment of computer vision solutions for pre-defined or custom-defined tasks. A foundational layer is proposed with a code-LLM AI orchestrator self-trained with reinforcement learning to create Python code based on its understanding

Taming Self-Training for Open-Vocabulary Object Detection

June 17, 2024/CVPR2024

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD.

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

June 17, 2024/CVPR2024

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive

Generating Enhanced Negatives for Training Language-Based Object Detectors

June 16, 2024/CVPR2024

The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative

Improving Language-Based Object Detection by Explicit Generation of Negative Examples

December 21, 2023/https://arxiv.org

The recent progress in language-based object detection with an open-vocabulary can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training from image captions with grounded bounding boxes (ground truth or pseudo-labeled) enable the models

Exploring Question Decomposition for Zero-Shot VQA

December 11, 2023/NeurIPS 2023

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently

DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning

December 10, 2023/NeurIPS 2023

Data augmentation techniques, such as image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches,

LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning

December 8, 2023/https://arxiv.org

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance. On

NEC Labs America Team Heading to NeurIPS23 in New Orleans

December 7, 2023

NEC Labs America is proud to be a Silver Sponsor for NeurIPS 2023 in New Orleans from December 10-16. Visit our booth to meet our team and learn about our intern opportunities in machine learning, data science, media analytics and integrated systems. Also, our Vijay Kumar.B.G, Samuel Schulter & Manmohan

OmniLabel: A Challenging Benchmark for Language-Based Object Detection

October 2, 2023/ICCV 2023

Language-based object detection is a promising direction towards building a natural interface to describe objects in images that goes far beyond plain category names. While recent methods show great progress in that direction, proper evaluation is lacking. With OmniLabel, we propose a novel task definition,

Improving Pseudo Labels for Open-Vocabulary Object Detection

August 2, 2023/https://arxiv.org

Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

June 18, 2023/CVPR 2023

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

October 24, 2022/ECCV 2022

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available

Single-Stream Multi-level Alignment for Vision-Language Pretraining