Bifröst: Peer-to-peer Load-balancing for Function Execution in Agentic AI Systems
Publication Date: 8/25/2025
Event: 31st International European Conference on Parallel and Distributed Computing (EURO-PAR 2025), Dresden, Germany
Reference: LNCS 15900, pp. 279–291, 2025
Authors: Giuseppe Coviello, NEC Laboratories America, Inc.; Kunal Rao, NEC Laboratories America, Inc.; Mohammad A. Khojastepour, NEC Laboratories America, Inc.; Srimat T. Chakradhar, NEC Laboratories America, Inc.
Abstract: Agentic AI systems rely on Large Language Models (LLMs) to execute complex tasks by invoking external functions. The efficiency of these systems depends on how well function execution is managed, especially under heterogeneous and high-variance workloads, where function execution times can range from milliseconds to several seconds. Traditional load-balancing techniques, such as round-robin, least-loaded, and Peak-EWMA (used in Linkerd), struggle in such settings: round-robin ignores load imbalance, least-loaded reacts slowly to rapid workload shifts, and Peak-EWMA relies on latency tracking, which is ineffective for workloads with high execution time variability. In this paper, we introduce Bifröst, a peer-to-peer load-balancing mechanism that distributes function requests based on real-time active request count rather than latency estimates. Instead of relying on centralized load-balancers or client-side decisions, Bifröst enables function-serving pods to dynamically distribute load by comparing queue lengths and offloading requests accordingly. This avoids unnecessary overhead while ensuring better responsiveness under high-variance workloads. Our evaluation on open-vocabulary object detection, multi-modal understanding, and code generation workloads shows that Bifröst improves function completion time by up to 20% when processing 13,700 requests from 137 AI agents on a 32-node Kubernetes cluster, outperforming both OpenFaaS and OpenFaaS with Linkerd. In an AI-driven insurance claims processing workflow, Bifröst achieves up to 25% faster execution.
Publication Link: https://link.springer.com/chapter/10.1007/978-3-031-99854-6_19