SimCache: Similarity Caching for Efficient VLM-based Scene Understanding

Publication Date: 6/11/2025

Event: ELVM Efficient Large Vision Models CVPR Workshop (2nd Edition)

Reference: pp. 1-10, 2025

Authors: Surya Selvam, NEC Laboratories America, Inc., Purdue University; Ravi K. Rajendran, NEC Laboratories America, Inc.; Murugan Sankaradas, NEC Laboratories America, Inc.; Anand Raghunathan, Purdue University; Srimat T. Chakradhar, NEC Laboratories America, Inc.

Abstract: Scene understanding systems analyze visual contexts by detecting objects, their attributes, and the interactions among them to provide a holistic interpretation. Understanding a scene requires analyzing multiple salient regions within a single video frame. Recently, Vision-Language Models (VLMs) have emerged as powerful tools for scene understanding, leveraging learned world knowledge to enable deployment without specialized training or fine-tuning. However, deploying VLMs in real-time applications is challenging due to their high computational and memory requirements, which limit processing throughput. We propose SimCache, a novel software-based caching mechanism that optimizes VLM-based scene understanding systems by reducing redundant computations. SimCache stores the embedding representation of a salient region and its detected activity, enabling the reuse of VLM computations for similar regions in future frames. Specifically, SimCache exploits two types of redundancy: (1) temporal locality, reusing computations for similar regions across adjacent frames, and (2) semantic locality, reusing computations for visually distinct regions that represent the same activity at different times. SimCache includes a multi-tier cache architecture with specialized cache search and refinement policies to exploit redundancy efficiently and accurately. Experiments on action recognition datasets demonstrate that SimCache improves system throughput by up to 9.4× and reduces VLM computations by up to 24.4× with minimal accuracy loss.

Publication Link: