Projects | Multimodal LLMs for AI DevOps

MEDIA ANALYTICS

PROJECTS

PEOPLE

PUBLICATIONS

PATENTS

Multimodal LLMs for AI DevOps

Safety-critical applications must account for all scenarios, including those posing high risks despite being under-observed in usual scenarios. Applications like autonomous driving incur a high development cost since they require extensive data collection, data curation, model training and verification, which are prohibitively expensive and pose barriers to new entrants in the space.

Our AI devops pipeline builds a high-fidelity digital twin of sensor data which allows for self-improvement of deployed models. We leverage our foundational vision-language models to automatically determine issues in currently deployed AI, pseudo-label or simulate training data, develop models with continual learning and use an LLM-based verification over diverse scenarios.

Multimodal LLMs for AI DevOps Project

Team Members: Sparsh Garg, Mingfu Liang (Intern), Ziyu Jiang

Publication Tags: mobility, autonomous driving, data system, devops, vision language models (vlm), large language models (llm), continual learning, pseudo-labeling, data curation, simulation, self-training, auto-labeling, verification

Featured Publications

MixLLM: Dynamic Routing in Mixed Large Language Models

April 29, 2025/2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025)

Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response

G-Litter Marine Litter Dataset Augmentation with Diffusion Models and Large Language Models on GPU Acceleration

March 12, 2025/Applications, Libraries, and Tools for Computational Science and Machine Learning on Heterogeneous HPC Environments Workshop at PDP 2025

Marine litter detection is crucial for environmental monitoring, yet the imbalance in existing datasets limits model performance in identifying various types of waste accurately. This paper presents an efficient data augmentation pipeline that combines generative diffusion models (e.g., Stable Diffusion)

Incident Diagnosing and Reporting System based on Retrieval Augmented Large Language Model

March 3, 2025/The 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

The Internet-of-Things (IoT) is widely used in many applications such as smart city, transportation, healthcare, and environment monitoring. A key task of IoT maintenance is to analyze the abnormal sensor records and generate incident report. Traditionally, domain experts engage in such labor intensive

Domain-Guided Weight Modulation for Semi-Supervised Domain Generalization

March 3, 2025/WACV 2025

Unarguably deep learning models capable of generalizing to unseen domain data while leveraging a few labels are of great practical significance due to low developmental costs. In search of this endeavor we study the challenging problem of semi-supervised domain generalization (SSDG) where the goal is

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

January 17, 2025/arXiv

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle

Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

December 19, 2024/arXiv

The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates

DiCE-M: Distributed Code Generation and Execution for Marine Applications – An Edge-Cloud Approach

December 7, 2024/International Workshop on Edge Intelligence in conjunction with ACM SEC 2024

Edge computing has emerged as a transformative technology that reduces application latency, improves cost efficiency, enhances security, and enables large-scale deployment of applications across various domains. In environmental monitoring, systems such as MegaSense[49], use low-cost sensors to gather

InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration (EMNLP 2024)

November 13, 2024/The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)

Large Language Models (LLMs) have achieved exceptional capabilities in open generation across various domains, yet they encounter difficulties with tasks that require intensive knowledge. To address these challenges, methods for integrating knowledge have been developed, which augment LLMs with domain-specific

DiCE: Distributed Code generation and Execution

November 5, 2024/The 22nd IEEE International Conference on Pervasive Intelligence and Computing (PICom 2024)

Generative artificial intelligence (GenAI), specifically, Large Language Models (LLMs), have shown tremendous potential in automating several tasks and improving human productivity. Recent works have shown them to be quite useful in writing and summarizing text (articles, blogs, poems, stories, songs,

iRAG: Advancing RAG for Videos with an Incremental Approach

October 21, 2024/The 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024)

Retrieval-augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for understanding of videos is appealing but there are two critical limitations. One-time, upfront conversion of all content

Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries

September 29, 2024/The 18th European Conference on Computer Vision ECCV 2024

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail safety-critical traffic scenarios. However, traditional methods for generating such scenarios often fall short in terms of controllability and realism; they also neglect the dynamics of agent interactions.

TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

September 24, 2024/27th IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)

Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges

Accelerating Distributed Machine Learning with an Efficient AllReduce Routing Strategy

September 23, 2024/Frontiers in Optics 2024, Denver, CO

We propose an efficient routing strategy for AllReduce transfers, which compromise of the dominant traffic in machine learning-centric datacenters, to achieve fast parameter synchronization in distributed machine learning, improving the average training time by 9%.

InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration (VLDB 2024)

August 28, 2024/International Workshop on LLM+KG: Data Management Opportunities in Unifying Large Language Models+Knowledge Graphs in conjunction with VLDB 2024, Guangzhou, China

Though Large Language Models (LLMs) have shown remarkable open-generation capabilities across diverse domains, they struggle with knowledge-intensive tasks. To alleviate this issue, knowledge integration methods have been proposed to enhance LLMs with domain-specific knowledge graphs using external modules.

Introducing Our New Project: Time Series Language Model for Explainable AI

August 13, 2024

Our new project, Time Series Language Model for Explainable AI, represents a significant leap forward in the field of forecasting and explainable AI. By combining advanced forecasting techniques with explainable AI, we are paving the way for a future where data-driven insights are not only accurate but

Agentic LLMs for AI Orchestration Project: Revolutionizing Complex Workflows

August 8, 2024

The development of Agentic LLMs for AI Orchestration represents a significant advancement in artificial intelligence. By seamlessly integrating computer vision, logic, and compute modules, our LLM is poised to revolutionize the way complex workflows are managed and executed. Supported by robust research

DFA-RAG: Conversational Semantic Router for Large Language Model with Definite Finite Automaton

July 27, 2024/The Forty-first International Conference on Machine Learning (ICML 2024), Vienna, Austria

This paper introduces the retrieval-augmented large language model with Definite Finite Automaton (DFA-RAG), a novel framework designed to enhance the capabilities of conversational agents using large language models (LLMs). Traditional LLMs face challenges in generating regulated and compliant responses

Optimizing LLM API usage costs with novel query-aware reduction of relevant enterprise data

July 3, 2024/NEC Technical Journal, Special Issue on Revolutionizing Business Practices with Generative AI

Costs of LLM API usage rise rapidly when proprietary enterprise data is used as context for user queries to generate more accurate responses from LLMs. To reduce costs, we propose LeanContext, which generates query-aware, compact and AI model-friendly summaries of relevant enterprise data context. This

Uncertainty Quantification for In-Context Learning of Large Language Models

June 20, 2024/2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico

In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) and revolutionized various fields by providing a few task-relevant demonstrations in the prompt. However, trustworthy issues with LLMs response, such as hallucination, have also been actively discussed. Existing

Pruning as a Domain-specific LLM Extractor

June 20, 2024/2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico

Large Language Models (LLMs) have exhibited remarkable proficiency across a wide array of NLP tasks. However, the escalation in model size also engenders substantial deployment costs. While few efforts have explored model pruning techniques to reduce the size of LLMs, they mainly center on general or

Taming Self-Training for Open-Vocabulary Object Detection

June 17, 2024/CVPR2024

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD.

LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

June 17, 2024/CVPR2024

Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear

Improving the Efficiency-Accuracy Trade-off of DETR-Style Models in Practice

June 17, 2024/The 7th Workshop on Efficient Deep Learning for Computer Vision at CVPR 2024

This report aims to provide a comprehensive view on the inference efficiency of DETR-style detection models. We provide the effect of the basic efficiency techniques and identify the factors that are easily applicable yet effectively improve the efficiency-accuracy trade-off. Specifically, we explore

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

June 17, 2024/CVPR2024

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of

Instantaneous Perception of Moving Objects in 3D

June 17, 2024/CVPR2024

The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior

ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System

June 17, 2024/Multimodal Algorithmic Reasoning (MAR) in conjunction with CVPR 2024

Retrieval-augmented generation (RAG) is used in natural language processing (NLP) to provide query-relevant information in enterprise documents to large language models (LLMs). Such enterprise context enables the LLMs to generate more informed and accurate responses. When enterprise data is primarily

ECO-LLM: LLM-based Edge Cloud Optimization

June 3, 2024/AI4Sys '24 at HPDC 2024

AI/ML techniques have been used to solve systems problems, but their applicability to customize solutions on-the-fly has been limited. Traditionally, any customization required manually changing the AI/ML model or modifying the code, configuration parameters, application settings, etc. This incurs too

LeanContext: Cost-efficient Domain-specific Question Answering Using LLMs

June 1, 2024/Natural Language Processing

Question-answering (QA) is a significant application of Large Language Models (LLMs), shaping chatbot capabilities across healthcare, education, and customer service. However, widespread LLM integration presents a challenge for small businesses due to the high expenses of LLM API usage. Costs rise rapidly

Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty

May 11, 2024/ICLR 2024 Workshop on Reliable and Responsible Foundation Models

Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text, typically in the form of (subject, relation, object) triples. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods

iRAG: An Incremental Retrieval Augmented Generation System for Videos

April 24, 2024/https://arxiv.org

Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist:

Self-Consistent Decoding for More Factual Open Responses

February 29, 2024/https://arxiv.org

Self-consistency has emerged as a powerful method for improving the accuracy of short answers generated by large language models. As previously defined, it only concerns the accuracy of a final answer parsed from generated text. In this work, we extend the idea to open response generation, by integrating

LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning

December 8, 2023/https://arxiv.org

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance. On

Controllable Safety-Critical Closed-Loop Traffic Simulation via Guided Diffusion

December 8, 2023/https://arxiv.org

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail traffic scenarios. Traditional methods for generating safety-critical scenarios often fall short in realism and controllability. Furthermore, these techniques generally neglect the dynamics of agent

Improving Pseudo Labels for Open-Vocabulary Object Detection

August 2, 2023/https://arxiv.org

Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances

Beyond One Model Fits All: A Survey of Domain Specialization for Large Language Models

June 9, 2023/arXiv

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task agnostic foundation for a wide range of applications. The great promise of LLMs as general task solvers motivated people to extend their functionality largely beyond

Single-Stream Multi-level Alignment for Vision-Language Pretraining

October 24, 2022/ECCV 2022

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable

Unsupervised Anomaly Detection with Self-Training and Knowledge Distillation

October 16, 2022/IEEE International Conference in Image Processing

Anomaly Detection (AD) aims to find defective patterns or abnormal samples among data, and has been a hot research topic due to various real-world applications. While various AD methods have been proposed, most of them assume the availability of a clean (anomaly-free) training set, which, however, may

T-Cell Receptor-Peptide Interaction Prediction with Physical Model Augmented Pseudo-Labeling

August 14, 2022/KDD 2022

Predicting the interactions between T-cell receptors (TCRs) and peptides is crucial for the development of personalized medicine and targeted vaccine in immunotherapy. Current datasets for training deep learning models of this purpose remain constrained without diverse TCRs and peptides. To combat the

Boosting Cross-Lingual Transfer via Self-Learning with Uncertainty Estimation

November 11, 2021/The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)

Recent multilingual pre-trained language models have achieved remarkable zero-shot performance, where the model is only finetuned on one source language and directly evaluated on target languages. In this work, we propose a self-learning framework that further utilizes unlabeled data of target languages,

Divide-and-Conquer for Lane-Aware Diverse Trajectory Prediction

June 19, 2021/CVPR 2021, Virtual

Trajectory prediction is a safety-critical tool for autonomous vehicles to plan and execute actions. Our work addresses two key challenges in trajectory prediction, learning multimodal outputs, and better predictions by imposing constraints using driving knowledge. Recent methods have achieved strong

SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

We propose advances that address two key challenges in future trajectory prediction: (i) multimodality in both training data and predictions and (ii) constant time inference regardless of number of agents. Existing trajectory predictions are fundamentally limited by lack of diversity in training data,

A Dataset for High-Level 3D Scene Understanding of Complex Road Scenes in the Top-View

June 17, 2019/Proceedings of CVPR 2019 Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics

We introduce a novel dataset for high-level 3D scene understanding of complex road scenes. Our annotations extend the existing datasets KITTI [5] and nuScenes [1] with semantically and geometrically meaningful attributes like the number of lanes or the existence of, and distance to, intersections, sidewalks

A Parametric Top-View Representation of Complex Road Scenes

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

In this paper, we address the problem of inferring the layout of complex road scenes given a single camera as input. To achieve that, we first propose a novel parameterized model of road layouts in a top-view representation, which is not only intuitive for human visualization but also provides an interpretable