Structured Representations for Fine-Grained Text-to-Image Retrieval in Remote Sensing
This thesis introduces a multimodal framework for fine-grained text-to-image retrieval in remote sensing, combining global and structured representations through scene graphs and semantic region embeddings. It addresses the limitations of CLIP-style global embeddings by capturing spatial, semantic, and relational details critical for fine-grained reasoning.
