Improving Language-Based Object Detection by Explicit Generation of Negative Examples

The recent progress in language-based object detection with an open-vocabulary can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training from image captions with grounded bounding boxes (ground truth or pseudo-labeled) enable the models to reason over an open-vocabulary and understand object descriptions in free-form text. In this work, we investigate the role of negative captions for training such language-based object detectors. While the fixed label space in standard object detection datasets clearly defines the set of negative classes, the free-form text used for language-based detection makes the space of potential negatives virtually infinite in size. We propose to leverage external knowledge bases and large-language-models to automatically generate contradictions for each caption in the training dataset. Furthermore, we leverage image-generate tools to create corresponding negative images to the contradicting caption. Such automatically generated data constitute hard negative examples for language-based detection and improve the model when trained from. Our experiments demonstrate the benefits of the automatically generated training data on two complex benchmarks.

Scale Up while Scaling Out Microservices in Video Analytics Pipelines

Modern video analytics applications comprise multiple microservices chained together as pipelines and executed on container orchestration platforms like Kubernetes. Kubernetes automatically handles the scaling of these microservices for efficient application execution. There are two popular choices for scaling microservices in Kubernetes i.e. scaling Out using Horizontal Pod Autoscaler (HPA) and scaling Up using Vertical Pod Autoscaler (VPA). Both these have been studied independently, but there isn’t much prior work studying the joint scaling of these two. This paper investigates joint scaling, i.e., scaling up while scaling out (HPA) is in action. In particular, we focus on scaling up CPU resources allocated to the application microservices. We show that allocating fixed resources does not work well for different workloads for video analytics pipelines. We also show that Kubernetes’ VPA in conjunction with HPA does not work well for varying application workloads. As a remedy to this problem, in this paper, we propose DataX AutoScaleUp, which performs efficiently scaling up of CPU resources allocated to microservices in video analytics pipelines while Kubernetes’ HPA is operational. DataX AutoScaleUp uses novel techniques to adjust the allocated computing resources to different microservices in video analytics pipelines to improve overall application performance. Through real-world video analytics applications like Face Recognition and Human Attributes, we show that DataX AutoScaleUp can achieve up to 1.45X improvement in application processing rate when compared to alternative approaches with fixed CPU allocation and dynamic CPU allocation using VPA.

Hierarchical Gaussian Mixture based Task Generative Model for Robust Meta-Learning

Meta-learning enables quick adaptation of machine learning models to new tasks with limited data. While tasks could come from varying distributions in reality, most of the existing meta-learning methods consider both training and testing tasks as from the same uni-component distribution, overlooking two critical needs of a practical solution: (1) the various sources of tasks may compose a multi-component mixture distribution, and (2) novel tasks may come from a distribution that is unseen during meta-training. In this paper, we demonstrate these two challenges can be solved jointly by modeling the density of task instances. We develop a meta training framework underlain by a novel Hierarchical Gaussian Mixture based Task Generative Model (HTGM). HTGM extends the widely used empirical process of sampling tasks to a theoretical model, which learns task embeddings, fits the mixture distribution of tasks, and enables density-based scoring of novel tasks. The framework is agnostic to the encoder and scales well with large backbone networks. The model parameters are learned end-to-end by maximum likelihood estimation via an Expectation-Maximization (EM) algorithm. Extensive experiments on benchmark datasets indicate the effectiveness of our method for both sample classification and novel task detection.

Exploring Question Decomposition for Zero-Shot VQA

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site:

Open-Ended Commonsense Reasoning with Unrestricted Answer Scope

Open-ended Commonsense Reasoning is defined as solving a commonsense question without providing 1) a short list of answer candidates and 2) a pre-defined answer scope. Conventional ways of formulating the commonsense question into a question-answering form or utilizing external knowledge to learn retrieval-based methods are less applicable in the open-ended setting due to an inherent challenge. Without pre-defining an answer scope or a few candidates, open-ended commonsense reasoning entails predicting answers by searching over an extremely large searching space. Moreover, most questions require implicit multi-hop reasoning, which presents even more challenges to our problem. In this work, we leverage pre-trained language models to iteratively retrieve reasoning paths on the external knowledge base, which does not require task-specific supervision. The reasoning paths can help to identify the most precise answer to the commonsense question. We conduct experiments on two commonsense benchmark datasets. Compared to other approaches, our proposed method achieves better performance both quantitatively and qualitatively.

Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering

In protein biophysics, the separation between the functionally important residues (forming the active site or binding surface) and those that create the overall structure (the fold) is a well-established and fundamental concept. Identifying and modifying those functional sites is critical for protein engineering but computationally nontrivial, and requires significant domain knowledge. To automate this process from a data-driven perspective, we propose a disentangled Wasserstein autoencoder with an auxiliary classifier, which isolates the function-related patterns from the rest with theoretical guarantees. This enables one-pass protein sequence editing and improves the understanding of the resulting sequences and editing actionsinvolved. To demonstrate its effectiveness, we apply it to T-cell receptors (TCRs), a well-studied structure-function case. We show that our method can be used to alterthe function of TCRs without changing the structural backbone, outperforming several competing methods in generation quality and efficiency, and requiring only 10% of the running time needed by baseline models. To our knowledge, this is the first approach that utilizes disentangled representations for TCR engineering.

Weakly-supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping

Weakly-Supervised Concealed Object Segmentation (WSCOS) aims to segment objects well blended with surrounding environments using sparsely-annotated data for model training. It remains a challenging task since (1) it is hard to distinguish concealed objects from the background due to the intrinsic similarity and (2) the sparsely-annotated training data only provide weak supervision for model learning. In this paper, we propose a new WSCOS method to address these two challenges. To tackle the intrinsic similarity challenge, we design a multi-scalefeature grouping module that first groups features at different granularities and then aggregates these grouping results. By grouping similar features together, it encourages segmentation coherence, helping obtain complete segmentation results for both single and multiple-object images. For the weak supervision challenge, we utilize the recently-proposed vision foundation model, “Segment Anything Model (SAM)”, and use the provided sparse annotations as prompts to generate segmentation masks, which are used to train the model. To alleviate the impact oflow-quality segmentation masks, we further propose a series of strategies, including multi-augmentation result ensemble, entropy-based pixel-level weighting, and entropy-based image-level selection. These strategies help provide more reliable supervision to train the segmentation model. We verify the effectiveness of our method on various WSCOS tasks, and experiments demonstrate that our method achieves state-of-the-art performance on these tasks.

DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning

Data augmentation techniques, such as image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter’s built-in assumption that each training image’s contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at

Controllable Safety-Critical Closed-Loop Traffic Simulation via Guided Diffusion

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail traffic scenarios. Traditional methods for generating safety-critical scenarios often fall short in realism and controllability. Furthermore, these techniques generally neglect the dynamics of agent interactions. To mitigate these limitations, we introduce a novel closed-loop simulation framework rooted in guided diffusion models. Our approach yields two distinct advantages: 1) the generation of realistic long-tail scenarios that closely emulate real-world conditions, and 2) enhanced controllability, enabling more comprehensive and interactive evaluations. We achieve this through novel guidance objectives that enhance road progress while lowering collision and off-road rates. We develop a novel approach to simulate safety-critical scenarios through an adversarial term in the denoising process, which allows the adversarial agent to challenge a planner with plausible maneuvers, while all agents in the scene exhibit reactive and realistic behaviors. We validate our framework empirically using the NuScenes dataset, demonstrating improvements in both realism and controllability. These findings affirm that guided diffusion models provide a robust and versatile foundation for safety-critical, interactive traffic simulation, extending their utility across the broader landscape of autonomous driving. For additional resources and demonstrations, visit our project page at

LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance. On the other hand, rule-based planners generalize well, but might fail to handle scenarios that require complex driving maneuvers. To address these limitations, we investigate the possibility of leveraging the common-sense reasoning capabilities of Large Language Models (LLMs) such as GPT4 and Llama2 to generate plans for self-driving vehicles. In particular, we develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner. Guided by commonsense reasoning abilities of LLMs, our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach. Through extensive evaluation on the nuPlan benchmark, we achieve state-of-the-art performance, outperforming all existing pure learning- and rule-based methods across most metrics. Our code will be available at