Foundational Vision-LLM for AI Linkage and Orchestration

Publication Date: 7/3/2024

Event: NEC Technical Journal, Special Issue on Revolutionizing Business Practices with Generative AI

Reference: Vol. 17, No. 2, pp 96-101, 2024

Authors: Zaid Khan, NEC Laboratories America, Inc., Northeastern University; Vijay Kumar B G, NEC Laboratories America, Inc.; Samuel Schulter, NEC Laboratories America, Inc.; Manmohan Chandraker, NEC Laboratories America, Inc., UC San Diego

Abstract: We propose a vision-LLM framework for automating the development and deployment of computer vision solutions for pre-defined or custom-defined tasks. A foundational layer is proposed with a code-LLM AI orchestrator self-trained with reinforcement learning to create Python code based on its understanding of a novel user-defined task, together with APIs, documentation and usage notes of existing task-specific AI models. Zero-shot abilities in specific domains are obtained through foundational vision-language models trained at a low compute expense leveraging existing computer vision models and datasets. An engine layer is proposed which comprises of several task-specific vision-language engines which can be compositionally utilized. An application-specific layer is proposed to improve performance in customer-specific scenarios, using novel LLM-guided data augmentation and question decomposition, besides standard fine-tuning tools. We demonstrate a range of applications including visual AI assistance, visual conversation, law enforcement, mobility, medical image reasoning and remote sensing.

Publication Link: https://www.nec.com/en/global/techrep/journal/g23/n02/g2302pa.html#anc-anchor-01