Long Zhao works at Google Research .


Improving Language-Based Object Detection by Explicit Generation of Negative Examples

The recent progress in language-based object detection with an open-vocabulary can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training from image captions with grounded bounding boxes (ground truth or pseudo-labeled) enable the models to reason over an open-vocabulary and understand object descriptions in free-form text. In this work, we investigate the role of negative captions for training such language-based object detectors. While the fixed label space in standard object detection datasets clearly defines the set of negative classes, the free-form text used for language-based detection makes the space of potential negatives virtually infinite in size. We propose to leverage external knowledge bases and large-language-models to automatically generate contradictions for each caption in the training dataset. Furthermore, we leverage image-generate tools to create corresponding negative images to the contradicting caption. Such automatically generated data constitute hard negative examples for language-based detection and improve the model when trained from. Our experiments demonstrate the benefits of the automatically generated training data on two complex benchmarks.

Improving Pseudo Labels for Open-Vocabulary Object Detection

Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances on PLs. In this paper, we aim to reduce the noise in PLs and propose a method called online Self-training And a Split-and-fusion head for OVD (SAS-Det). First, the self-training finetunes VLMs to generate high quality PLs while prevents forgetting the knowledge learned in the pretraining. Second, a split-and-fusion (SAF) head is designed to remove the noise in localization of PLs, which is usually ignored in existing methods. It also fuses complementary knowledge learned from both precise ground truth and noisy pseudo labels to boost the performance. Extensive experiments demonstrate SAS-Det is both efficient and effective. Our pseudo labeling is 3 times faster than prior methods. SAS-Det outperforms prior state-of-the-art models of the same scale by a clear margin and achieves 37.4 AP50 and 27.3 APr on novel categories of the COCO and LVIS benchmarks, respectively.

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at https://github.com/xiaofeng94/VL-PLM.

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects. We approach the task by modeling the video as tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, prior works suffer from scene biases with privacy and ethical concerns. We only use the keypoint modality which reduces scene biases and prevents acquiring detailed visual data that may contain private or biased information of users. We improve the multiscale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and data augmentations tailored to the keypoint signals to aid model training. We demonstrate the model’s strength and interpretability on two widely-used datasets (Volleyball and Collective Activity). COMPOSER achieves up to +5.4% improvement with just the keypoint modality (Code is available at https://github.com/hongluzhou/composer.).