Projects | Exploiting Unlabeled Data with Vision and Language Models for Object Detection

MEDIA ANALYTICS

PROJECTS

PEOPLE

PUBLICATIONS

PATENTS

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

We propose a simple but effective way to mine unlabeled images using recently proposed vision and language (V\&L) models to generate pseudo labels for both known and novel categories, which suits both tasks, SSOD and OVD. The contributions of our work are as follows: 1) We leverage V&L models for improving object detection frameworks by generating pseudo labels on unlabeled data. 2) A simple but effective strategy to improve the localization quality of pseudo labels scored with the V\&L model CLIP. 3) State-of-the-art results for novel categories on the COCO open-vocabulary detection setting. 4) We showcase the benefits of VL-PLM in a semi-supervised object detection setting.

Collaborators: Shiyu Zhao, Zhixing Zhang, Long Zhao, Vijay Kumar B.G., Anastasis Stathopoulos, Manmohan Chandraker, Dimitris Metaxas

Exploiting Unlabeled Data with Vision and Language Models for Object Detection Paper

Shiyu Zhao¹ Zhixing Zhang¹ Samuel Schulter² Long Zhao³ Vijay Kumar.B.G² Anastasis Stathopoulos¹ Manmohan Chandraker^{2, 4} Dimitris Metaxas¹

¹ Rutgers University ² NEC Laboratories America ³ Google Research ⁴ UC San Diego

ECCV 2022

[ Paper ] [ Code & models ]

Abstract

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection.

Overview

Overview of the proposed VL-PLM to mine unlabeled images with vision & language models to generate pseudo labels for object detection. The top part illustrates our class-agnostic proposal generator, which improves the pseudo label localization by using the class-agnostic proposal score and the repeated application of the RoI head. The bottom part illustrates the scoring of cropped regions with the V\&L model based on the target category names. The chosen category names can be adjusted for the desired downstream task. After thresholding and NMS, we get the final pseudo labels. For some tasks like SSOD, we will merge external pseudo labels for a teacher model with ours before thresholding and NMS.

Main Results

Visualizations of the pseudo labels (PLs) from VL-PLM. Only boxes for target categories in the scene are shown. (a) Good cases. All target objects are located with appropriate boxes. (b) The most common types of failure cases in our PLs, i.e., part domination, redundant boxes, missing instances, and grouped instances.

Quantitative evaluation of our VL-PLM approach on open vocabulary object detection task on COCO 2017 dataset.

Quantitative evaluation of our VL-PLM approach on semi-supervised object detection task on COCO 2017 dataset.

Visualization of the final detection results. Only boxes for target categories in the scene are shown. (a) Novel categories as the target. (b) Base categories as the target. The major failure cases belong to three types, i.e., missing instances, redundant boxes, or grouped instances

Acknowledgements

All images were taken from the public MS COCO dataset. Please see the dataset for more information on the individual image sources. This webpage template is inspired by Colorful Image Colorization

Projects | Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Exploiting Unlabeled Data with Vision and Language Models for Object Detection Paper

Abstract

Overview

Main Results

Acknowledgements

Contact Us

About Us

Our Pages

Read Our Blog Posts