Improving Language-Based Object Detection by Explicit Generation of Negative Examples

Publication Date: 12/21/2023

Event: https://arxiv.org

Reference: https://arxiv.org/abs/2308.06412

Authors: Shiyu Zhao, NEC Laboratories America, Inc.; Long Zhao, Google; Vijay Kumar B.G., NEC Laboratories America, Inc.; Yumin Suh, NEC Laboratories America, Inc.; Dimitris N. Metaxas, Rutgers University; Manmohan Chandraker, NEC Laboratories America, Inc.; Samuel Schulter, NEC Laboratories America, Inc.

Abstract: The recent progress in language-based object detection with an open-vocabulary can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training from image captions with grounded bounding boxes (ground truth or pseudo-labeled) enable the models to reason over an open-vocabulary and understand object descriptions in free-form text. In this work, we investigate the role of negative captions for training such language-based object detectors. While the fixed label space in standard object detection datasets clearly defines the set of negative classes, the free-form text used for language-based detection makes the space of potential negatives virtually infinite in size. We propose to leverage external knowledge bases and large-language-models to automatically generate contradictions for each caption in the training dataset. Furthermore, we leverage image-generate tools to create corresponding negative images to the contradicting caption. Such automatically generated data constitute hard negative examples for language-based detection and improve the model when trained from. Our experiments demonstrate the benefits of the automatically generated training data on two complex benchmarks.

Publication Link: https://arxiv.org/abs/2308.06412