Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine

When a disaster strikes, every minute matters. Emergency responders need to know which areas have been flooded, where roads are blocked, and which neighborhoods still have intact structures. Historically, making sense of satellite imagery under those conditions has required trained analysts, specialized software, and hours of painstaking review. What if you could type “Find flooded residential areas” and let the system do the rest?

Open SAT How We Taught AI to Search Satellite Images Like a Search Engine

Introduction

That is the ambition behind Open-SAT, a new open-vocabulary satellite image retrieval system developed by our Integrated System researchers in collaboration with North South University, a private research university in Dhaka, Bangladesh. The paper Open-SAT: LLM-Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery introduces a training-free approach that allows users to query satellite imagery in plain English and receive highly accurate results without retraining models or predefining object categories. The authors of the paper are Md Adnan Arefeen, 2023 and 2024 NEC Labs America Intern, North South University; Biplob Debnath, NEC Laboratories America, Inc.; Ravi K. Rajendran, NEC Laboratories America, Inc.; Murugan Sankaradas, NEC Laboratories America, Inc.; Srimat T. Chakradhar, NEC Laboratories America, Inc.

The Problem

Modern satellites produce images of staggering resolution. A single image of the Princeton, New Jersey area used in the research spans more than 16,000 by 9,600 pixels and covers roughly 70 square kilometers. Within that frame, objects like solar panels, construction sites, or swimming pools appear as clusters of just a few pixels. Finding a specific feature in an image of that size based on a natural-language query is not a trivial problem.

Existing vision-language models like CLIP, which match text queries to images by comparing their numeric representations (called embeddings), were not designed with satellite imagery in mind. They perform well when an image contains one dominant object, but struggle when a tile is dense with overlapping features. A tile containing a river might also show bridges, forests, and mountain ridges. A tile with a swimming pool might sit amid roads, parking lots, and rooftops. When CLIP tries to retrieve all “river” tiles, its similarity scores for river tiles and non-river tiles overlap significantly, making clean retrieval unreliable.

Compounding this is the threshold problem. Standard retrieval systems filter results by requiring a minimum similarity score between a query and an image. But no single threshold works well across different datasets, object types, or image conditions. Set it too high, and you miss relevant tiles. Set it too low, and you flood the system with noise.

The Solution

Open-SAT addresses both problems with a two-stage architecture. In the ingestion phase, a satellite image is divided into small tiles (224 by 224 pixels each), encoded into embeddings using Remote-CLIP, a satellite-optimized variant of the CLIP model, and stored in a vector database. This work is done once, up front, with no knowledge of which queries will come later.

At query time, Open-SAT does something genuinely novel. Rather than adjusting a similarity threshold or fine-tuning the image encoder, it refines the text embedding itself using a large language model (LLM). The system first prompts the LLM to extract the object of interest from the user’s natural language query, then prompts it again to generate a list of objects that typically appear alongside that object in satellite imagery. A query for “river,” for example, might yield surrounding objects such as bridges, forests, wetlands, and roads.

The surrounding objects serve as the basis for a classification-style retrieval mechanism. Rather than asking “Is this tile similar enough to ‘river’?” the system asks, “Is this tile more like ‘river’ than to ‘bridge,’ ‘forest,’ or ‘road’?” Tiles where the object of interest wins that comparison are selected; all others are discarded—no threshold required.

Open-SAT goes one step further with a technique the researchers call text embedding modification. Inspired by the classic word-vector arithmetic that yields results like Queen ≈ King − Man + Woman, the system adjusts the query embedding to reflect better how the target object appears in context. It computes embeddings for phrases like “a satellite photo of a river with a bridge” and subtracts the influence of “a satellite photo of a bridge,” nudging the final embedding to more precisely represent the river itself rather than everything around it. The adjusted embeddings for each surrounding object are averaged together to produce a refined query vector that the system uses for the final similarity search.

“Open-vocabulary retrieval in satellite imagery is especially challenging because a single tile can span hundreds of meters and contain dozens of visually distinct objects. By using LLMs to reason about what surrounds an object of interest, Open-SAT shifts retrieval from a simple similarity comparison to a context-aware classification, and that makes a significant difference in precision and recall,” said Biplob Debnath, Senior Researcher.

The Results

The results bear that out. Tested on three publicly available satellite imagery benchmarks, Open-SAT improved F1 scores by up to 16 percentage points over the threshold-based Remote-CLIP baseline. On the UCM dataset, which contains 21 fine-grained land-use categories and considerable visual overlap between classes, Open-SAT achieved a recall of 83.57% compared to 50.05% for the baseline, an improvement of more than 33 percentage points. Critically, these gains came without additional training, dataset-specific tuning, or a comparable number of tiles overall.

Per-class analysis tells an equally compelling story. On the EuroSAT dataset, Open-SAT improved recall in 8 of 10 scene categories, with the largest gains in structurally complex categories like residential and industrial zones, where recall increased by more than 8 percentage points. On UCM, 16 of 21 categories saw improvement, with several urban land-use categories gaining more than 15 percentage points.
The system is also practical to deploy. A demonstration described in the paper shows a user uploading a high-resolution Princeton-area satellite image, clicking a button to index its 3,225 tiles in about 35 seconds, then submitting the query “Solar panel.” The system returns 932 matching tile instances in 3 seconds, each one a verifiable piece of evidence extracted from the original image.

Real-World Applications

What makes Open-SAT particularly valuable for real-world applications is its zero-shot design. Users do not need to define categories in advance or label training data for new object types. If a query involves an object the system has never encountered before, the LLM can still reason about its surroundings and refine the embedding accordingly. That flexibility is critical for the kinds of open-ended, exploratory queries that analysts ask in fields like environmental monitoring, urban planning, insurance assessment, and disaster response.

Future directions outlined in the paper include further refining retrieval accuracy, expanding to broader datasets, and extending Open-SAT for real-time monitoring applications. As satellite imagery becomes more abundant and more accessible, tools that let non-specialists ask plain-language questions of that data will only become more important. Open-SAT is a meaningful step in that direction.

About The Authors

Publication to Blog Post Series

Our Publication-to-Blog Post Series highlights the real-world impact of our latest research, translating complex innovations into practical applications. From AI and machine learning to optical networking and intelligent systems, we showcase how our work goes beyond theory to address real-world challenges. Explore how cutting-edge research at NEC Laboratories America is driving measurable outcomes across industries.

Open SAT How We Taught AI to Search Satellite Images Like a Search Engine

Open SAT: How We Taught AI to Search Satellite Images Like a Search Engine

Satellite imagery is vast, high-resolution, and rich with information, but finding specific objects within it using natural language has remained a stubborn challenge. Open-SAT, developed by researchers at NEC Laboratories America and North South University, tackles this problem without retraining any models.
Training Small AI Models Without Blindly Trusting Big Teacher Models

Training Small AI Models Without Blindly Trusting Big Teacher Models

Machine learning is shifting from learning from data alone to learning from both data and teacher models. Beta-KD uses uncertainty-aware Bayesian weighting to train compact multimodal AI without blindly trusting every teacher signal.
How Rule-Driven Routing Makes Retrieval-Augmented Generation Smarter

How Rule-Driven Routing Makes Retrieval-Augmented Generation Smarter

Most retrieval-augmented generation systems stop at documents, ignoring the relational databases that power finance, healthcare, and research. Our researchers built a rule-driven framework that learns which source to query for each question, delivering better answers at lower computational cost.
Rethinking Molecular Drug Design From Generation to Control

Rethinking Molecular Drug Design: From Generation to Control

Designing drug molecules is no longer just about generation, but control. NEC Laboratories America introduces MolDiffdAE, a diffusion-based framework that enables precise, multi-objective tuning of 3D molecular properties. By learning a semantic space, researchers can efficiently guide design, accelerating drug discovery and exploration of chemical space.