Vision and Language refer to the integration and joint processing of visual and textual information. This interdisciplinary field focuses on developing models and algorithms that can effectively understand, interpret, and generate content that involves both visual data (such as images or videos) and language data (such as natural language text).

The convergence of vision and language leverages deep neural networks, including convolutional neural networks (CNNs) for visual processing and recurrent neural networks (RNNs) or transformer models for language understanding. This integration facilitates a more comprehensive understanding of the content in multimodal data and opens up opportunities for various applications, including image understanding, content generation, and human-machine interaction. Advances in this field have led to the development of powerful multimodal models capable of handling diverse tasks involving both vision and language.

Posts

Improving Pseudo Labels for Open-Vocabulary Object Detection

Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances on PLs. In this paper, we aim to reduce the noise in PLs and propose a method called online Self-training And a Split-and-fusion head for OVD (SAS-Det). First, the self-training finetunes VLMs to generate high quality PLs while prevents forgetting the knowledge learned in the pretraining. Second, a split-and-fusion (SAF) head is designed to remove the noise in localization of PLs, which is usually ignored in existing methods. It also fuses complementary knowledge learned from both precise ground truth and noisy pseudo labels to boost the performance. Extensive experiments demonstrate SAS-Det is both efficient and effective. Our pseudo labeling is 3 times faster than prior methods. SAS-Det outperforms prior state-of-the-art models of the same scale by a clear margin and achieves 37.4 AP50 and 27.3 APr on novel categories of the COCO and LVIS benchmarks, respectively.