A Foundational Model refers to a large, pre-trained vision-language model that serves as the core component of the proposed vision-LLM framework. These models are trained on diverse datasets to provide general capabilities in understanding and generating visual and textual information. They enable zero-shot performance, meaning they can tackle new tasks without additional training. The foundational model underpins the framework’s ability to automate the development and deployment of computer vision solutions, supporting the generation of Python code and integration with existing task-specific AI models.

Posts

Foundational Vision-LLM for AI Linkage and Orchestration

We propose a vision-LLM framework for automating development and deployment of computer vision solutions for pre-defined or custom-defined tasks. A foundational layer is proposed with a code-LLM AI orchestrator self-trained with reinforcement learning to create Python code based on its understanding of a novel user-defined task, together with APIs, documentation and usage notes of existing task-specific AI models. Zero-shot abilities in specific domains are obtained through foundational vision-language models trained at a low compute expense leveraging existing computer vision models and datasets. An engine layer is proposed which comprises of several task-specific vision-language engines which can be compositionally utilized. An application-specific layer is proposed to improve performance in customer-specific scenarios, using novel LLM-guided data augmentation and question decomposition, besides standard fine-tuning tools. We demonstrate a range of applications including visual AI assistance, visual conversation, law enforcement, mobility, medical image reasoning and remote sensing.