Integrated Systems

Read our publications from our world-class team of researchers from our Integrated Systems department which innovates, designs, and prototypes high-performance intelligent distributed systems, applications, and services on complex, large-scale communication networks like 5G and beyond. We develop next-generation wireless technologies for sensing the world, localizing critical assets, and improving the capacity, coverage, and scalability of communication networks like 5G and beyond.

Posts

Optimal Single-User Interactive Beam Alignment with Feedback Delay

Communication in Millimeter wave (mmWave) band relies on narrow beams due to directionality, high path loss, and shadowing. One can use beam alignment (BA) techniques to find and adjust the direction of these narrow beams. In this paper, BA at the base station (BS) is considered, where the BS sends a set of BA packets to scan different angular regions while the user listens to the channel and sends feedback to the BS for each received packet. It is assumed that the packets and feedback received at the user and BS, respectively, can be correctly decoded. Motivated by practical constraints such as propagation delay, a feedback delay for each BA packet is considered. At the end of the BA, the BS allocates a narrow beam to the user including its angle of departure for data transmission and the objective is to maximize the resulting expected beamforming gain. A general framework for studying this problem is proposed based on which a lower bound on the optimal performance as well as an optimality achieving scheme are obtained. Simulation results reveal significant performance improvements over the state-of-the-art BA methods in the presence of feedback delay.

RAG-check: Evaluating Multimodal Retrieval Augmented Generation Performance

Retrieval-augmented generation (RAG) improves large language models (LLMs) by using external knowledge to guide response generation, reducing hallucinations. However, RAG, particularly multi-modal RAG, can introduce new hallucination sources: (i) the retrieval process may select irrelevant pieces (e.g., documents, images) as raw context from the database, and (ii) retrieved images are processed into text-based context via vision-language models (VLMs) or directly used by multi-modal language models (MLLMs) like GPT-4o, which may hallucinate. To address this, we propose a novel framework to evaluate the reliability of multi-modal RAG using two performance measures: (i) the relevancy score (RS), assessing the relevance of retrieved entries to the query, and (ii) the correctness score (CS), evaluating the accuracy of the generated response. We train RS and CS models using a ChatGPT-derived database and human evaluator samples. Results show that both models achieve ~88% accuracy on test data. Additionally, we construct a 5000-sample human-annotated database evaluating the relevancy of retrieved pieces and the correctness of response statements. Our RS model aligns with human preferences 20% more often than CLIP in retrieval, and our CS model matches human preferences ~91% of the time. Finally, we assess various RAG systems’ selection and generation performances using RS and CS.

Re-ranking the Context for Multimodal Retrieval Augmented Generation

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge to generate a response within a context with improved accuracy and reduced hallucinations. However, multi-modal RAG systems face unique challenges: (i) the retrieval process may select irrelevant entries to user query (e.g., images, documents), and (ii) vision-language models or multi-modal language models like GPT-4o may hallucinate when processing these entries to generate RAG output. In this paper, we aim to address the first challenge, i.e, improving the selection of relevant context from the knowledge-base in retrieval phase of the multi-modal RAG. Specifically, we leverage the relevancy score (RS) measure designed in our previous work for evaluating the RAG performance to select more relevant entries in retrieval process. The retrieval based on embeddings, say CLIP-based embedding, and cosine similarity usually perform poorly particularly for multi-modal data. We show that by using a more advanced relevancy measure, one can enhance the retrieval process by selecting more relevant pieces from the knowledge-base and eliminate the irrelevant pieces from the context by adaptively selecting up-to-k entries instead of fixed number of entries. Our evaluation using COCO dataset demonstrates significant enhancement in selecting relevant context and accuracy of the generated response.

DiCE-M: Distributed Code Generation and Execution for Marine Applications – An Edge-Cloud Approach

Edge computing has emerged as a transformative technology that reduces application latency, improves cost efficiency, enhances security, and enables large-scale deployment of applications across various domains. In environmental monitoring, systems such as MegaSense[49], use low-cost sensors to gather and process real-time air quality data through edge-cloud collaboration, highlighting the critical role of edge computing in enabling scalable, efficient solutions. Similarly, marine science increasingly requires real-time processing and analysis of marine data from remote, resource-constrained environments. In this paper, we extend the power of edge computing by integrating it with Generative Artificial Intelligence(GenAI),specifically large language models (LLMs), to address challenges in marine science applications. We propose DiCE-M (Distributed Code generation and Execution for Marine applications), a robust system that uses LLM to generate distributed code for marine applications and then utilizes a runtime to efficiently execute it on an edge+cloud computing infrastructure. Specifically, DiCE-M leverages edge computing to execute lightweight AI models locally on unmanned surface vehicles(USVs)while offloading complex tasks to the cloud, thus balancing computational load and enabling realtime monitoring in marine environments. We use marine litter identification as an example application to demonstrate the utility of DiCE-M. Our results show that DiCE-M reduces latency by more than 2X when marine litter is not detected and cuts cloud computing costs by more than half compared to traditional cloud-based approaches. By selectively cropping and transmitting relevant image portions, DiCE-M further improves bandwidth efficiency, making it a reliable and cost-effective solution for deploying AI-driven applications on resource-constrained USVs in dynamic marine environments.

DiCE: Distributed Code generation and Execution

Generative artificial intelligence (GenAI), specifically, Large Language Models (LLMs), have shown tremendous potential in automating several tasks and improving human productivity. Recent works have shown them to be quite useful in writing and summarizing text (articles, blogs, poems, stories, songs, etc.), answering questions, brainstorming ideas, and even writing code. Several LLMs have emerged specifically targeting code generation. Given a prompt, these LLMs can generate code in any desired programming language. Many tools like ChatGPT, CoPilot, CodeWhisperer, Cody, DeepSeek Coder, StarCoder, etc. are now routinely being used by software developers. However, most of the prior work in automatic code generation using LLMs is focused on obtaining “correct” and working code, and mainly runs on a single computer (serial code). In this paper, we take this to the next level, where LLMs are leveraged to generate code for execution on a distributed infrastructure. We propose a novel system called DiCE, which takes serial code as input and automatically generates distributed version of the code and efficiently executes it on a distributed setup. DiCE consists of two main components (a) LLM-based tool (Synthia) to understand dependencies in serial code and automatically generate distributed version of the code using specialized programming model and semantics, and (b) Runtime (Hermod) to understand the semantics in the distributed code and realize efficient execution on a cluster of machines (distributed infrastructure). DiCE currently focuses on visual programs synthesized by tools like ViperGPT [1] and VisReP [2] (serial code), automatically identifies higher-level task parallelism opportunities (e.g., parallel object detection), transforms the code to exploit the parallelism, and finally efficiently executes it on a cluster of machines. Through our experiments using 100 examples from the GQA dataset [3], we show that the serial codes generated by ViperGPT are successfully transformed into distributed codes which are then efficiently executed on a cluster of machines by DiCE. We note that DiCE correctly identifies opportunities for parallelism and distributes tasks on separate GPUs within the cluster. We observe an average speed-up of 2X, 2.95X, and 3.7X, and an average efficiency of 1, 0.74 and 0.48 for a cluster of 2 nodes, 4 nodes, and 8 nodes, respectively.

Transformer-Aided Semantic Communications

The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.

iRAG: Advancing RAG for Videos with an Incremental Approach

Retrieval-augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for understanding of videos is appealing but there are two critical limitations. One-time, upfront conversion of all content in large corpus of videos into text descriptions entails high processing times. Also, not all information in the rich video data is typically captured in the text descriptions. Since user queries are not known apriori, developing a system for video to text conversion and interactive querying of video data is challenging.To address these limitations, we propose an incremental RAG system called iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of a large corpus of videos. Unlike traditional RAG, iRAG quickly indexes large repositories of videos, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the videos to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long video to text conversion times, and overcomes information loss issues due to conversion of video to text, by doing on-demand query-specific extraction of details in video data. This ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of a large corpus of videos. Experimental results on real-world datasets demonstrate 23x to 25x faster video to text ingestion, while ensuring that latency and quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any user querying.

TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to 4× while maintaining information accuracy.

Knowledge-enhanced Prompt Learning for Open-domain Commonsense Reasoning

Neural language models for commonsense reasoning often formulate the problem as a QA task and make predictions based on learned representations of language after fine-tuning. However, without providing any fine-tuning data and pre-defined answer candidates, can neural language models still answer commonsense reasoning questions only relying on external knowledge? In this work, we investigate a unique yet challenging problem-open-domain commonsense reasoning that aims to answer questions without providing any answer candidates and fine-tuning examples. A team comprising NECLA (NEC Laboratories America) and NEC Digital Business Platform Unit proposed method leverages neural language models to iteratively retrieve reasoning chains on the external knowledge base, which does not require task-specific supervision. The reasoning chains can help to identify the most precise answer to the commonsense question and its corresponding knowledge statements to justify the answer choice. This technology has proven its effectiveness in a diverse array of business domains.

A Perspective on Deep Vision Performance with Standard Image and Video Codecs

Resource-constrained hardware such as edge devices or cell phones often rely on cloud servers to provide the required computational resources for inference in deep vision models. However transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs such as JPEG or H.264 is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings our analysis extends beyond image and action classification to localization and dense prediction tasks thus providing a more comprehensive perspective.