Agentic LLMs for AI Orchestration

MEDIA ANALYTICS

PROJECTS

PEOPLE

PUBLICATIONS

PATENTS

We develop an agentic LLM to solve complex workflows by deploying a combination of computer vision, logic and compute modules. Based on a natural language task specification, our LLM generates a plan to accomplish the task using available tools. The plan is represented as a Python program synthesized to deploy the available tools, which can be anything that can be invoked programmatically. The planner can quickly adapt to new tools based on available documentation and code. The use of reinforced self-training with weak supervision allows efficient training, with the option to include human feedback. Our agentic LLM outperforms competing ones on benchmark visual reasoning tasks despite utilizing fewer parameters.

Team Members: Vijay Kumar BG

Keyword Tag: AIOps

Agentic LLMs for AI Orchestration Project

Agentic LLMs for AI Orchestration (AI-Plex)

Featured Publications

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

June 17, 2024/CVPR2024

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive

Exploring Question Decomposition for Zero-Shot VQA

December 11, 2023/NeurIPS 2023

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

June 18, 2023/CVPR 2023

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose

Single-Stream Multi-level Alignment for Vision-Language Pretraining

October 24, 2022/ECCV 2022

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable

Projects | Agentic LLMs for AI Orchestration