Structured Representations for Fine-Grained Text-to-Image Retrieval in Remote Sensing

2 minute read

Problem Statement & Motivation

Traditional CLIP-based models represent both text and images as single vectors within a shared embedding space. While effective for generic retrieval, this approach fails to capture the fine-grained spatial and semantic relationships that characterize remote sensing (RS) imagery, such as object counts, arrangements, and contextual interactions.

Remote sensing datasets are often visually repetitive and textually generic, limiting retrieval quality and interpretability. For industries such as insurance and environmental monitoring, accurate retrieval requires a deeper, structured understanding of scenes.

This work addresses these challenges by introducing structured, multimodal representations—spanning vector, graph, and scene levels—to enhance cross-modal alignment and reasoning.

Our Approach

The proposed framework integrates graph-based and scene-based representations on top of traditional vector embeddings, yielding a unified retrieval system capable of reasoning over object relations, attributes, and global semantics.

1. Fine-Grained Dataset Creation

Generated fine-grained captions using large vision–language models (Gemini-2.5-flash).
Conducted human evaluations showing improved semantic richness and diversity over existing datasets (e.g., RSICD, NWPU, RSITMD).
Released new benchmark-ready datasets for fine-grained RS retrieval tasks.

2. Structured Multimodal Representations

Image-to-Graph: scene graph generation from RS images via VLM-based object, attribute, and relation extraction.
Text-to-Graph: structured graph extraction from captions for relational grounding.
Scene Decomposition: lightweight alternative to graphs—segmenting images and captions into semantically coherent regions.

3. Retrieval Framework

Implemented multiple cross-modal retrieval modes:

V2V – CLIP-style global embedding matching.
G2G / VG2VG – graph-level and hybrid vector–graph matching using Graph Matching Networks (GMN).
S2S / V2S / S2V – scene-level retrieval as a cost-effective alternative requiring no graph construction.

Key Results

VG2VG (Hybrid Vector–Graph) achieves +8.30% mean accuracy gain over baselines.
Scene-based matching provides +3.55% gain with minimal computation—ideal for scalable deployment.
Human evaluations confirm stronger grounding and interpretability for fine-grained captions.
Demonstrated applicability to real-world insurance and risk assessment scenarios.

Contributions

Fine-Grained Caption Generation and Validation for remote sensing datasets.
Unified multimodal retrieval framework combining global, graph, and scene-based representations.
Novel Graph Matching Network (GMN) for dynamic graph similarity scoring.
Scene-based retrieval as a lightweight, training-free alternative.
Extensive benchmarking across RSICD, NWPU, RSITMD, and FIT-RS datasets.

Defense & Supervision

Institution: EPFL – School of Computer and Communication Sciences
Supervisors: Prof. Devis Tuia (EPFL ENAC) & Ciprian Tomoiaga (AXA Group Operations)
Defense Date: August 26, 2025
Author: Oussama Gabouj

Resources

Date: August 28, 2025
PDF: Full Thesis PDF
Slides:Presentation Slides

This work bridges vision–language modeling and remote sensing, paving the way for structured, interpretable, and fine-grained retrieval systems in complex geospatial domains.

Oussama Gabouj