Foundational Vision-Language Models

MEDIA ANALYTICS

PROJECTS

PEOPLE

PUBLICATIONS

PATENTS

Our foundational models enable ubiquitous usage of computer vision across scenarios, applications and user preferences. By combining the power of very large-scale computer vision and natural language datasets, together with innovations in visual instruction following, our foundational models yield deeper domain-specific insights, at lower data center costs, and with lower hallucinations. Our foundational models are powering applications in road accident analysis, insurance, safety and law enforcement, through rich descriptions of scenes and humans, with question-answering abilities and localizable insights.

Team Member: Vijay Kumar BG

Publication Tag: Lucy

Foundational Vision-Language Models Project

Foundational Vision-Language Models (Lucy)

Featured Publications

Taming Self-Training for Open-Vocabulary Object Detection

June 17, 2024/CVPR2024

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD.

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

June 17, 2024/CVPR2024

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive

Generating Enhanced Negatives for Training Language-Based Object Detectors

June 16, 2024/CVPR2024

The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative

Exploring Question Decomposition for Zero-Shot VQA

December 11, 2023/NeurIPS 2023

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

June 18, 2023/CVPR 2023

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose

Split to Learn: Gradient Split for Multi-Task Human Image Analysis

January 3, 2023/WACV23

This paper presents an approach to train a unified deep network that simultaneously solves multiple human-related tasks. A multi-task framework is favorable for sharing information across tasks under restricted computational resources. However, tasks not only share information but may also compete for

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

October 24, 2022/ECCV 2022

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available

Single-Stream Multi-level Alignment for Vision-Language Pretraining

October 24, 2022/ECCV 2022

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable

Projects | Foundational Vision-Language Models