arXiv (pronounced “archive”) is an online platform and preprint repository that serves as a digital archive for scholarly research papers and scientific manuscripts. It allows researchers from various academic disciplines, including physics, mathematics, computer science, biology, and many others, to share their work with the global scientific community before formal peer review and publication in traditional academic journals.


Beyond One Model Fits All: A Survey of Domain Specialization for Large Language Models

Beyond One Model Fits All: A Survey of Domain Specialization for Large Language Models Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task agnostic foundation for a wide range of applications. The great promise of LLMs as general task solvers motivated people to extend their functionality largely beyond just a “chatbot”, and use it as an assistant or even replacement for domain experts and tools in specific domains such as healthcare, finance, and education. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles, caused by the heterogeneity of domain data, the sophistication of domain knowledge, the uniqueness of domain objectives, and the diversity of the constraints (e.g., various social norms, cultural conformity, religious beliefs, and ethical standards in the domain applications). To fill such a gap, explosively increase research, and practices have been conducted in very recent years on the domain specialization of LLMs, which, however, calls for a comprehensive and systematic review to better summarizes and guide this promising domain. In this survey paper, first, we propose a systematic taxonomy that categorizes the LLM domain specialization techniques based on the accessibility to LLMs and summarizes the framework for all the subcategories as well as their relations and differences to each other. We also present a comprehensive taxonomy of critical application domains that can benefit from specialized LLMs, discussing their practical significance and open challenges. Furthermore, we offer insights into the current research status and future trends in this area.

Q: How to Specialize Large Vision Language Models to Data Scarce VQA Tasks? A: Self Train on Unlabeled Images!

Q: How to Specialize Large Vision Language Models to Data Scarce VQA Tasks? A: Self Train on Unlabeled Images! Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge based VQA or VQA in non natural image domains are orders of magnitude smaller than those for general purpose VQA. While collecting additional labels for specialized tasks or domains can be challenging, unlabeled images are often available. We introduce SelTDA (Self Taught Data Augmentation), a strategy for finetuning large VLMs on small scale VQA datasets. SelTDA uses the VLM and target dataset to build a teacher model that can generate question answer pseudolabels directly conditioned on an image alone, allowing us to pseudolabel unlabeled images. SelTDA then finetunes the initial VLM on the original dataset augmented with freshly pseudolabeled images. We describe a series of experiments showing that our self taught data augmentation increases robustness to adversarially searched questions, counterfactual examples and rephrasings, improves domain generalization, and results in greater retention of numerical reasoning skills. The proposed strategy requires no additional annotations or architectural modifications, and is compatible with any modern encoder decoder multimodal transformer. Code available at

OmniLabel: A Challenging Benchmark for Language Based Object Detection

OmniLabel: A Challenging Benchmark for Language Based Object Detection Language based object detection is a promising direction towards building a natural interface to describe objects in images that goes far beyond plain category names. While recent methods show great progress in that direction, proper evaluation is lacking. With OmniLabel, we propose a novel task definition, dataset, and evaluation metric. The task subsumes standard and open vocabulary detection as well as referring expressions. With more than 28K unique object descriptions on over 25K images, OmniLabel provides a challenging benchmark with diverse and complex object descriptions in a naturally open vocabulary setting. Moreover, a key differentiation to existing benchmarks is that our object descriptions can refer to one, multiple or even no object, hence, providing negative examples in free form text. The proposed evaluation handles the large label space and judges performance via a modified average precision metric, which we validate by evaluating strong language based baselines. OmniLabel indeed provides a challenging test bed for future research on language based detection. Visit the project website at

Dynamic Prompting: A Unified Framework for Prompt Tuning

Dynamic Prompting: A Unified Framework for Prompt Tuning It has been demonstrated that prompt tuning is highly effective in efficiently eliciting knowledge from language models (LMs). However, the prompt tuning still lags behind fine tuning, especially when the LMs are small. P tuning v2 (Liu et al., 2021b) makes it comparable with finetuning by adding continuous prompts for every layer of the pre trained model. However, prepending fixed soft prompts for all instances, regardless of their discrepancy, is doubtful. In particular, the inserted prompt position, length, and the representations ofprompts for diversified instances through different tasks could all affect the prompt tuning performance. To fill this gap, we propose dynamic prompting (DP): the position, length, and prompt representation can all be dynamically optimized with respect to different tasks and instances. We conduct comprehensive experiments on the SuperGlue benchmark tovalidate our hypothesis and demonstrate substantial improvements. We also derive a unified framework for supporting our dynamic prompting strategy. In particular, we use a simple learning network and Gumble Softmax for learning instance dependent guidance. Experimental results show that simple instance level position aware soft prompts can improve the classification accuracy of up to 6 points on average on five datasets, reducing its gap with fine tuning. Besides, we also prove its universal usefulness under full data, few shot, andmultitask regimes. Combining them together can even further unleash the power of DP, narrowing the distance between fine tuning.

Exploring the limits of ChatGPT for Query or Aspect based Text Summarization

Exploring the limits of ChatGPT for Query or Aspect based Text Summarization Text summarization has been a crucial problem in natural language processing (NLP) for several decades. It aims to condense lengthy documents into shorter versions while retaining the most critical information. Various methods have been proposed for text summarization, including extractive and abstractive summarization. The emergence of large language models (LLMs) like GPT3 and ChatGPT has recently created significant interest in using these models for text summarization tasks. Recent studies (Goyal et al., 2022, Zhang et al., 2023) have shown that LLMs generated news summaries are already on par with humans. However, the performance of LLMs for more practical applications like aspect or query based summaries is underexplored. To fill this gap, we conducted an evaluation of ChatGPT’s performance on four widely used benchmark datasets, encompassing diverse summaries from Reddit posts, news articles, dialogue meetings, and stories. Our experiments reveal that ChatGPT’s performance is comparable to traditional fine tuning methods in terms of Rouge scores. Moreover, we highlight some unique differences between ChatGPT generated summaries and human references, providing valuable insights into the superpower of ChatGPT for diverse text summarization tasks. Our findings call for new directions in this area, and we plan to conduct further research to systematically examine the characteristics of ChatGPT generated summaries through extensive human evaluation.

RoVaR: Robust Multi agent Tracking through Dual layer Diversity in Visual and RF Sensor Fusion

RoVaR: Robust Multi agent Tracking through Dual layer Diversity in Visual and RF Sensor Fusion The plethora of sensors in our commodity devices provides a rich substrate for sensor fused tracking. Yet, today’s solutions are unable to deliver robust and high tracking accuracies across multiple agents in practical, everyday environments a feature central to the future of immersive and collaborative applications. This can be attributed to the limited scope of diversity leveraged by these fusion solutions, preventing them from catering to the multiple dimensions of accuracy, robustness (diverse environmental conditions) and scalability (multiple agents) simultaneously. In this work, we take an important step towards this goal by introducing the notion of dual layer diversity to the problem of sensor fusion in multi agent tracking. We demonstrate that the fusion of complementary tracking modalities, passive/relative (e.g., visual odometry) and active/absolute tracking (e.g., infrastructure assisted RF localization) offer a key first layer of diversity that brings scalability while the second layer of diversity lies in the methodology of fusion, where we bring together the complementary strengths of algorithmic (for robustness) and data driven (for accuracy) approaches. RoVaR is an embodiment of such a dual layer diversity approach that intelligently attends to cross modal information using algorithmic and data driven techniques that jointly share the burden of accurately tracking multiple agents in the wild. Extensive evaluations reveal RoVaR’s multi dimensional benefits in terms of tracking accuracy (median of 15cm), robustness (in unseen environments), light weight (runs in real time on mobile platforms such as Jetson Nano/TX2), to enable practical multi agent immersive applications in everyday environments.

MM TTA: Multi Modal Test Time Adaptation for 3D Semantic Segmentation

MM TTA: Multi Modal Test Time Adaptation for 3D Semantic Segmentation Test time adaptation approaches have recently emerged as a practical solution for handling domain shift without access to the source domain data. In this paper, we propose and explore a new multi modal extension of test time adaptation for 3D semantic segmentation. We find that directly applying existing methods usually results in performance instability at test time because multi modal input is not considered jointly. To design a framework that can take full advantage of multi modality, where each modality provides regularized self supervisory signals to other modalities, we propose two complementary modules within and across the modalities. First, Intra modal Pseudolabel Generation (Intra PG) is introduced to obtain reliable pseudo labels within each modality by aggregating information from two models that are both pre trained on source data but updated with target data at different paces. Second, Inter modal Pseudo label Refinement (Inter PR) adaptively selects more reliable pseudo labels from different modalities based on a proposed consistency scheme. Experiments demonstrate that our regularized pseudo labels produce stable self learning signals in numerous multi modal test time adaptation scenarios for 3D semantic segmentation. Visit our project website at˜mas/MM TTA

Fast Few shot Debugging for NLU Test Suites

Fast Few shot Debugging for NLU Test Suites We study few shot debugging of transformer based natural language understanding models, using recently popularized test suites to not just diagnose but correct a problem. Given a few debugging examples of a certain phenomenon, and a held out test set of the same phenomenon, we aim to maximize accuracy on the phenomenon at a minimal cost of accuracy on the original test set. We examine several methods that are faster than full epoch retraining. We introduce a new fast method, which samples a few in danger examples from the original training set. Compared to fast methods using parameter distance constraints or Kullback Leibler divergence, we achieve superior original accuracy for comparable debugging accuracy.

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Exploiting Unlabeled Data with Vision and Language Models for Object Detection Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open vocabulary detection, where a model needs to generalize to unseen object categories, and semi supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state of the art for open vocabulary object detection. Our code is available at PLM.

Single-Stream Multi-Level Alignment for Vision-Language Pretraining | ArXiv

Single-Stream Multi-Level Alignment for Vision-Language Pretraining is self-supervised vision language pretraining from pure images and text with a contrastive loss is effective, but ignores fine grained alignment due to a dual stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non contrastive methods were capable of finer grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine grained patch token, and conceptual/semantic, using two novel tasks: symmetric cross modality reconstruction (XMM) and a pseudo labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross modal information to reconstruct the masked token, thus improving fine grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a momentum encoder to recommend other important keywords that are missing from the caption but represented in the image, and then train the visual encoder to predict the presence of those keywords, helping it learn semantic concepts that are essential for grounding a textual token to an image region. We demonstrate competitive performance and improved data efficiency on image text retrieval, grounding, visual question answering/reasoning against larger models and models trained on more data. Code and models available at this http URL.