Md Yusuf Sarwar Uddin works at University of Missouri-Kansas City.

Posts

ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System

Retrieval-augmented generation (RAG) is used in natural language processing (NLP) to provide query-relevant information in enterprise documents to large language models (LLMs). Such enterprise context enables the LLMs to generate more informed and accurate responses. When enterprise data is primarily videos AI models like vision language models (VLMs) are necessary to convert information in videos into text. While essential this conversion is a bottleneck especially for large corpus of videos. It delays the timely use of enterprise videos to generate useful responses. We propose ViTA a novel method that leverages two unique characteristics of VLMs to expedite the conversion process. As VLMs output more text tokens they incur higher latency. In addition large (heavyweight) VLMs can extract intricate details from images and videos but they incur much higher latency per output token when compared to smaller (lightweight) VLMs that may miss details. To expedite conversion ViTA first employs a lightweight VLM to quickly understand the gist or overview of an image or a video clip and directs a heavyweight VLM (through prompt engineering) to extract additional details by using only a few (preset number of) output tokens. Our experimental results show that ViTA expedites the conversion time by as much as 43% without compromising the accuracy of responses when compared to a baseline system that only uses a heavyweight VLM.

FactionFormer: Context-Driven Collaborative Vision Transformer Models for Edge Intelligence

Edge Intelligence has received attention in the recent times for its potential towards improving responsiveness, reducing the cost of data transmission, enhancing security and privacy, and enabling autonomous decisions by edge devices. However, edge devices lack the power and compute resources necessary to execute most Al models. In this paper, we present FactionFormer, a novel method to deploy resource-intensive deep-learning models, such as vision transformers (ViT), on resource-constrained edge devices. Our method is based on a key observation: edge devices are often deployed in settings where they encounter only a subset of the classes that the resource intensive Al model is trained to classify, and this subset changes across deployments. Therefore, we automatically identify this subset as a faction, devise on-the fly a bespoke resource-efficient ViT called a modelette for the faction and set up an efficient processing pipeline consisting of a modelette on the device, a wireless network such as 5G, and the resource-intensive ViT model on an edge server, all of which work collaboratively to do the inference. For several ViT models pre-trained on benchmark datasets, FactionFormer’s modelettes are up to 4× smaller than the corresponding baseline models in terms of the number of parameters, and they can infer up to 2.5× faster than the baseline setup where every input is processed by the resource-intensive ViT on the edge server. Our work is the first of its kind to propose a device-edge collaborative inference framework where bespoke deep learning models for the device are automatically devised on-the-fly for most frequently encountered subset of classes.

Chimera: Context-Aware Splittable Deep Multitasking Models for Edge Intelligence

Design of multitasking deep learning models has mostly focused on improving the accuracy of the constituent tasks, but the challenges of efficiently deploying such models in a device-edge collaborative setup (that is common in 5G deployments) has not been investigated. Towards this end, in this paper, we propose an approach called Chimera 1 for training (done Offline) and deployment (done Online) of multitasking deep learning models that are splittable across the device and edge. In the offline phase, we train our multi-tasking setup such that features from a pre-trained model for one of the tasks (called the Primary task) are extracted and task-specific sub-models are trained to generate the other (Secondary) tasks’ outputs through a knowledge distillation like training strategy to mimic the outputs of pre-trained models for the tasks. The task-specific sub-models are designed to be significantly lightweight than the original pre-trained models for the Secondary tasks. Once the sub-models are trained, during deployment, for given deployment context, characterized by the configurations, we search for the optimal (in terms of both model performance and cost) deployment strategy for the generated multitasking model, through finding one or multiple suitable layer(s) for splitting the model, so that inference workloads are distributed between the device and the edge server and the inference is done in a collaborative manner. Extensive experiments on benchmark computer vision tasks demonstrate that Chimera generates splittable multitasking models that are at least ~ 3 x parameter efficient than the existing such models, and the end-to-end device-edge collaborative inference becomes ~ 1.35 x faster with our choice of context-aware splitting decisions.