SimCache: Similarity Caching for Efficient VLM-based Scene Understanding
Scene understanding systems analyze visual contexts by detecting objects, their attributes, and the interactions among them to provide a holistic interpretation. Understanding a scene requires analyzing multiple salient regions within a single video frame. Recently, Vision-Language Models (VLMs) have