Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine
When a disaster strikes, every minute matters. Emergency responders need to know which areas have been flooded, where roads are blocked, and which neighborhoods still have intact structures. Historically, making sense of satellite imagery under those conditions has required trained analysts, specialized software, and hours of painstaking review. What if you could type “Find flooded residential areas” and let the system do the rest?
Introduction
That is the ambition behind Open-SAT, a new open-vocabulary satellite image retrieval system developed by our Integrated System researchers in collaboration with North South University, a private research university in Dhaka, Bangladesh. The paper Open-SAT: LLM-Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery introduces a training-free approach that allows users to query satellite imagery in plain English and receive highly accurate results without retraining models or predefining object categories. The authors of the paper are Md Adnan Arefeen, 2023 and 2024 NEC Labs America Intern, North South University; Biplob Debnath, NEC Laboratories America, Inc.; Ravi K. Rajendran, NEC Laboratories America, Inc.; Murugan Sankaradas, NEC Laboratories America, Inc.; Srimat T. Chakradhar, NEC Laboratories America, Inc.
The Problem
Modern satellites produce images of staggering resolution. A single image of the Princeton, New Jersey area used in the research spans more than 16,000 by 9,600 pixels and covers roughly 70 square kilometers. Within that frame, objects like solar panels, construction sites, or swimming pools appear as clusters of just a few pixels. Finding a specific feature in an image of that size based on a natural-language query is not a trivial problem.
Existing vision-language models like CLIP, which match text queries to images by comparing their numeric representations (called embeddings), were not designed with satellite imagery in mind. They perform well when an image contains one dominant object, but struggle when a tile is dense with overlapping features. A tile containing a river might also show bridges, forests, and mountain ridges. A tile with a swimming pool might sit amid roads, parking lots, and rooftops. When CLIP tries to retrieve all “river” tiles, its similarity scores for river tiles and non-river tiles overlap significantly, making clean retrieval unreliable.
Compounding this is the threshold problem. Standard retrieval systems filter results by requiring a minimum similarity score between a query and an image. But no single threshold works well across different datasets, object types, or image conditions. Set it too high, and you miss relevant tiles. Set it too low, and you flood the system with noise.
The Solution
Open-SAT addresses both problems with a two-stage architecture. In the ingestion phase, a satellite image is divided into small tiles (224 by 224 pixels each), encoded into embeddings using Remote-CLIP, a satellite-optimized variant of the CLIP model, and stored in a vector database. This work is done once, up front, with no knowledge of which queries will come later.
At query time, Open-SAT does something genuinely novel. Rather than adjusting a similarity threshold or fine-tuning the image encoder, it refines the text embedding itself using a large language model (LLM). The system first prompts the LLM to extract the object of interest from the user’s natural language query, then prompts it again to generate a list of objects that typically appear alongside that object in satellite imagery. A query for “river,” for example, might yield surrounding objects such as bridges, forests, wetlands, and roads.
The surrounding objects serve as the basis for a classification-style retrieval mechanism. Rather than asking “Is this tile similar enough to ‘river’?” the system asks, “Is this tile more like ‘river’ than to ‘bridge,’ ‘forest,’ or ‘road’?” Tiles where the object of interest wins that comparison are selected; all others are discarded—no threshold required.
Open-SAT goes one step further with a technique the researchers call text embedding modification. Inspired by the classic word-vector arithmetic that yields results like Queen ≈ King − Man + Woman, the system adjusts the query embedding to reflect better how the target object appears in context. It computes embeddings for phrases like “a satellite photo of a river with a bridge” and subtracts the influence of “a satellite photo of a bridge,” nudging the final embedding to more precisely represent the river itself rather than everything around it. The adjusted embeddings for each surrounding object are averaged together to produce a refined query vector that the system uses for the final similarity search.
“Open-vocabulary retrieval in satellite imagery is especially challenging because a single tile can span hundreds of meters and contain dozens of visually distinct objects. By using LLMs to reason about what surrounds an object of interest, Open-SAT shifts retrieval from a simple similarity comparison to a context-aware classification, and that makes a significant difference in precision and recall,” said Biplob Debnath, Senior Researcher.
The Results
The results bear that out. Tested on three publicly available satellite imagery benchmarks, Open-SAT improved F1 scores by up to 16 percentage points over the threshold-based Remote-CLIP baseline. On the UCM dataset, which contains 21 fine-grained land-use categories and considerable visual overlap between classes, Open-SAT achieved a recall of 83.57% compared to 50.05% for the baseline, an improvement of more than 33 percentage points. Critically, these gains came without additional training, dataset-specific tuning, or a comparable number of tiles overall.
Per-class analysis tells an equally compelling story. On the EuroSAT dataset, Open-SAT improved recall in 8 of 10 scene categories, with the largest gains in structurally complex categories like residential and industrial zones, where recall increased by more than 8 percentage points. On UCM, 16 of 21 categories saw improvement, with several urban land-use categories gaining more than 15 percentage points.
The system is also practical to deploy. A demonstration described in the paper shows a user uploading a high-resolution Princeton-area satellite image, clicking a button to index its 3,225 tiles in about 35 seconds, then submitting the query “Solar panel.” The system returns 932 matching tile instances in 3 seconds, each one a verifiable piece of evidence extracted from the original image.
Real-World Applications
What makes Open-SAT particularly valuable for real-world applications is its zero-shot design. Users do not need to define categories in advance or label training data for new object types. If a query involves an object the system has never encountered before, the LLM can still reason about its surroundings and refine the embedding accordingly. That flexibility is critical for the kinds of open-ended, exploratory queries that analysts ask in fields like environmental monitoring, urban planning, insurance assessment, and disaster response.
Future directions outlined in the paper include further refining retrieval accuracy, expanding to broader datasets, and extending Open-SAT for real-time monitoring applications. As satellite imagery becomes more abundant and more accessible, tools that let non-specialists ask plain-language questions of that data will only become more important. Open-SAT is a meaningful step in that direction.
About The Authors
Biplob Debnath is a Senior Researcher in the Integrated Systems Department at NEC Laboratories America, where he leads global initiatives in generative AI, large language models, and multimodal analytics. He holds a PhD in Electrical and Computer Engineering from the University of Minnesota, an Executive MBA from the Quantic School of Business and Technology, and a Bachelor of Science in Computer Science & Engineering from the Bangladesh University of Engineering and Technology. At NEC, his research spans multimodal AI, remote sensing, video analytics, log analytics, data deduplication. He has been instrumental in developing NEC’s AI infrastructure stack, driving solutions across industries including telecommunications, finance, transportation, and smart cities.
Ravi K. Rajendran is a Senior Associate Researcher in the Integrated Systems Department at NEC Laboratories America. He received his MS in Computer Science from Boston University and his BE in Electronics and Communication Engineering from Anna University. His research focuses on real-time embedded systems, sensor networks, and AI acceleration. At NEC, he contributes to the development of integrated platforms for smart infrastructure and automation. He contributes to projects on distributed computing, AI infrastructure, and system integration. Ravi supports research efforts in developing scalable computing environments and data pipelines that underpin NEC’s enterprise-grade AI and analytics solutions. His work spans low-level systems programming, algorithm optimization, and deployment testing.
Publication to Blog Post Series
Our Publication-to-Blog Post Series highlights the real-world impact of our latest research, translating complex innovations into practical applications. From AI and machine learning to optical networking and intelligent systems, we showcase how our work goes beyond theory to address real-world challenges. Explore how cutting-edge research at NEC Laboratories America is driving measurable outcomes across industries.








