Agentic RAG Aerospace — GRPO Fine-Tuned LoRA Adapter

A PEFT LoRA adapter for Qwen/Qwen2.5-0.5B-Instruct fine-tuned with Group Relative Policy Optimization (GRPO) on aerospace research RAG tasks. This model powers the Agentic RAG Gym — an RL environment for training AI agents to research like experts.

Quick Start — Usage with PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "williyam/agentic-rag-aerospace-grpo")
tokenizer = AutoTokenizer.from_pretrained("williyam/agentic-rag-aerospace-grpo")

messages = [
    {"role": "system", "content": "You are an aerospace research assistant. Analyze the retrieved documents and provide a comprehensive technical answer."},
    {"role": "user", "content": "Compare scramjet and ramjet propulsion for hypersonic vehicles based on the retrieved documents."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model Description

This adapter was trained as part of Agentic RAG Gym, an open-source reinforcement learning framework where autonomous agents learn to research through multi-step retrieval, reasoning, critique, and verification — supervised by real domain graders.

The GRPO training loop uses domain-expert graders as reward signals, not proxy rewards. The agents receive per-step feedback on retrieval relevance, reasoning quality, answer completeness, efficiency, and anti-reward-hacking penalties.

Used by Agentic RAG Gym HF Space

This model is used by the Agentic RAG Gym HF Space as the default fine-tuned model for aerospace domain tasks. The Space provides:

Interactive Mode — Step through research tasks manually
Auto Pilot — Watch the agent research autonomously
Multi-Domain — Switch between Aerospace and Legal Research
Real-Time Rewards — See per-step reward breakdowns

Training Details

Configuration

Parameter	Value
Base Model	Qwen/Qwen2.5-0.5B-Instruct
Training Method	GRPO (Group Relative Policy Optimization)
Library	TRL (Transformer Reinforcement Learning)
LoRA Rank (r)	16
LoRA Alpha (α)	32
LoRA Target Modules	q_proj, k_proj, v_proj, o_proj
Optimizer	AdamW (8-bit)
Learning Rate	5e-6
Epochs	2
Group Size	4
Max Completion Tokens	512
Training Time	~116 minutes

Reward Function

The composite reward combines five components (all bounded [0.01, 0.99]):

Component	Weight	Description
Retrieval Relevance	25%	Relevance of retrieved documents to the query
Reasoning Quality	20%	Evidence of logical analysis and coherent reasoning
Answer Completeness	30%	Coverage of all required technical aspects
Efficiency	15%	Task completion within reasonable steps
Anti-Hacking Penalty	10%	Guards against repetition, keyword stuffing, degeneracy

Anti-Reward-Hacking

Built-in adversarial guards penalize:

Repetitive sentences/phrases
Keyword stuffing without coherent reasoning
Degenerate outputs (empty, trivially short, nonsensical)
Query manipulation designed to game relevance scores

Evaluation Results

Task-Level Comparison

Task	Difficulty	Baseline	GRPO-Trained	Delta
Hypersonic Vehicle Design	Hard	0.482	0.521	+0.039
Propulsion Comparison	Easy	0.508	0.562	+0.053
Mars EDL Architecture	Medium	0.574	0.568	-0.006
Life Support Design	Medium	0.592	0.590	-0.002
Debris Mitigation	Easy	0.633	0.689	+0.056
Mean	—	0.558	0.586	+0.028

The model shows consistent improvement on easy and hard tasks (+5.3% to +5.6% on easy, +3.9% on hard), while medium-difficulty tasks remain stable. This pattern suggests the model learned domain-specific retrieval and reasoning strategies rather than shallow pattern matching.

Training Curves

Left: GRPO training loss over 80 steps — shows characteristic sharp drops during policy updates followed by stabilization.
Right: Mean reward (grader score) over training — fluctuates between 0.3-0.7 with an overall upward trend.

Baseline vs. GRPO-Trained

Side-by-side comparison across all 5 aerospace tasks. The GRPO-trained model (gold) outperforms or matches the baseline (dark red) on every task.

Score Distribution

The GRPO-trained model's score distribution shifts rightward compared to baseline, indicating more consistently higher scores.

Intended Use

This model is designed for agentic RAG tasks in aerospace research. It's trained to:

Generate effective retrieval queries for aerospace topics
Reason over technical documents (propulsion, materials, orbital mechanics)
Produce structured, evidence-grounded technical answers
Self-critique and refine its research strategy

Links

Resource	URL
HF Space (Live Demo)	Agentic RAG Gym
GitHub Repository	agentic-rag-gym
Training Notebook	Google Colab
Base Model	Qwen/Qwen2.5-0.5B-Instruct
Agentic RAG OS	Live Platform

Citation

@misc{agentic-rag-gym-2026,
  title={Agentic RAG Gym: An RL Framework for Training AI Research Agents},
  author={williyam},
  year={2026},
  url={https://github.com/williyam-m/agentic-rag-gym}
}

Built for the Meta × OpenEnv × Hugging Face × PyTorch Hackathon

Downloads last month: 31

Model tree for williyam/agentic-rag-aerospace-grpo

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(554)

this model

Space using williyam/agentic-rag-aerospace-grpo 1

Evaluation results

Mean Grader Score (GRPO-Trained)
self-reported

0.586
Mean Grader Score (Baseline)
self-reported

0.558