Publication Date: 8/2/2023
Authors: Shiyu Zhao, Rutgers University, NEC Laboratories America, Inc.; Samuel Schulter, NEC Laboratories America, Inc.; Long Zhao, Google Research ; Zhixing Zhang, Rutgers University, NEC Laboratories America, Inc.; Vijay Kumar BG, NEC Laboratories America, Inc.; Yumin Suh, NEC Laboratories America, Inc.; Manmohan Chandraker, NEC Laboratories America, Inc., UC San Diego; Dimitris Metaxas, Rutgers University
Abstract: Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances on PLs. In this paper, we aim to reduce the noise in PLs and propose a method called online Self-training And a Split-and-fusion head for OVD (SAS-Det). First, the self-training finetunes VLMs to generate high quality PLs while prevents forgetting the knowledge learned in the pretraining. Second, a split-and-fusion (SAF) head is designed to remove the noise in localization of PLs, which is usually ignored in existing methods. It also fuses complementary knowledge learned from both precise ground truth and noisy pseudo labels to boost the performance. Extensive experiments demonstrate SAS-Det is both efficient and effective. Our pseudo labeling is 3 times faster than prior methods. SAS-Det outperforms prior state-of-the-art models of the same scale by a clear margin and achieves 37.4 AP50 and 27.3 APr on novel categories of the COCO and LVIS benchmarks, respectively.
Publication Link: https://arxiv.org/pdf/2308.06412.pdf