Prompting Beyond Retrieval with GRAD: A Generative Retrieval-Aligned Demonstrator for Robust Few-Shot Reasoning

2 minute read

Abstract

Large Language Models (LLMs) excel at reasoning but often struggle with multi-step tasks due to limited contextual understanding. While Retrieval-Augmented Generation (RAG) provides external context, it relies on static datasets and struggles to generalize across diverse queries. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. GRAD is grounded in a generative strategy allowing adaptive prompting without needing external retrieval. We demonstrate the superiority of GRAD under the budget constraints, where we limit both the number of tokens per demonstration and the output response budget. Trained solely on math dataset, GRAD consistently outperforms strong baselines ranging from math reasoning to advanced STEM questions on Qwen2.5-14B. This showcases the robust generalization of GRAD to out-of-distribution (OOD) STEM tasks, such as physics, chemistry and computer science. Since demonstrations can be generated by a different, smaller model, GRAD might reduce training cost while still maintaining similar accuracy. This work introduces a scalable demonstration generator model that is the first step towards a new few-shot learning paradigm in constrained, limited-resource settings.

Method

We design a two-stage architecture:

Demonstration Generator: An LLM is trained to generate concise, input-specific demonstrations for few-shot prompting.
Answering Model: A frozen LLM receives the generated prompt and produces the final answer.

This architecture decouples demonstration generation from reasoning, offering greater modularity and control.

We explore three training strategies for the generator:

SFT (Supervised Fine-Tuning): Helps the Demonstration Generator mimic retrieval-based RAG outputs and format demonstrations correctly for few-shot prompting.
Reinforcement Learning using PPO and GRPO: Directly optimizes the generation policy with task-specific reward signals.

Demonstrations are generated dynamically at inference time, without relying on retrieval corpora, and are optimized to fit within token budget constraints.

Comparison & Benchmarks

We evaluate GRAD against the following baselines:

Retrieval-Augmented Generation (RAG)
Zero-Shot Prompting: No in-context examples provided.

Benchmarks include:

In-domain reasoning tasks: e.g., GSM8K.
Out-of-domain (OOD) STEM tasks: physics, chemistry, computer science, and symbolic logic.

GRAD consistently outperforms all baselines, particularly under strict constraints on input length and output budget.

Paper

Status: Submitted to EMNLP 2025.
Repository: Public GitHub repo will be available post-acceptance.

Oussama Gabouj

Abstract

Method

Comparison & Benchmarks

Paper