Agentic RAG Aerospace β GRPO Fine-Tuned LoRA Adapter
A PEFT LoRA adapter for Qwen/Qwen2.5-0.5B-Instruct fine-tuned with Group Relative Policy Optimization (GRPO) on aerospace research RAG tasks. This model powers the Agentic RAG Gym β an RL environment for training AI agents to research like experts.
Quick Start β Usage with PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "williyam/agentic-rag-aerospace-grpo")
tokenizer = AutoTokenizer.from_pretrained("williyam/agentic-rag-aerospace-grpo")
messages = [
{"role": "system", "content": "You are an aerospace research assistant. Analyze the retrieved documents and provide a comprehensive technical answer."},
{"role": "user", "content": "Compare scramjet and ramjet propulsion for hypersonic vehicles based on the retrieved documents."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Model Description
This adapter was trained as part of Agentic RAG Gym, an open-source reinforcement learning framework where autonomous agents learn to research through multi-step retrieval, reasoning, critique, and verification β supervised by real domain graders.
The GRPO training loop uses domain-expert graders as reward signals, not proxy rewards. The agents receive per-step feedback on retrieval relevance, reasoning quality, answer completeness, efficiency, and anti-reward-hacking penalties.
Used by Agentic RAG Gym HF Space
This model is used by the Agentic RAG Gym HF Space as the default fine-tuned model for aerospace domain tasks. The Space provides:
- Interactive Mode β Step through research tasks manually
- Auto Pilot β Watch the agent research autonomously
- Multi-Domain β Switch between Aerospace and Legal Research
- Real-Time Rewards β See per-step reward breakdowns
Training Details
Configuration
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-0.5B-Instruct |
| Training Method | GRPO (Group Relative Policy Optimization) |
| Library | TRL (Transformer Reinforcement Learning) |
| LoRA Rank (r) | 16 |
| LoRA Alpha (Ξ±) | 32 |
| LoRA Target Modules | q_proj, k_proj, v_proj, o_proj |
| Optimizer | AdamW (8-bit) |
| Learning Rate | 5e-6 |
| Epochs | 2 |
| Group Size | 4 |
| Max Completion Tokens | 512 |
| Training Time | ~116 minutes |
Reward Function
The composite reward combines five components (all bounded [0.01, 0.99]):
| Component | Weight | Description |
|---|---|---|
| Retrieval Relevance | 25% | Relevance of retrieved documents to the query |
| Reasoning Quality | 20% | Evidence of logical analysis and coherent reasoning |
| Answer Completeness | 30% | Coverage of all required technical aspects |
| Efficiency | 15% | Task completion within reasonable steps |
| Anti-Hacking Penalty | 10% | Guards against repetition, keyword stuffing, degeneracy |
Anti-Reward-Hacking
Built-in adversarial guards penalize:
- Repetitive sentences/phrases
- Keyword stuffing without coherent reasoning
- Degenerate outputs (empty, trivially short, nonsensical)
- Query manipulation designed to game relevance scores
Evaluation Results
Task-Level Comparison
| Task | Difficulty | Baseline | GRPO-Trained | Delta |
|---|---|---|---|---|
| Hypersonic Vehicle Design | Hard | 0.482 | 0.521 | +0.039 |
| Propulsion Comparison | Easy | 0.508 | 0.562 | +0.053 |
| Mars EDL Architecture | Medium | 0.574 | 0.568 | -0.006 |
| Life Support Design | Medium | 0.592 | 0.590 | -0.002 |
| Debris Mitigation | Easy | 0.633 | 0.689 | +0.056 |
| Mean | β | 0.558 | 0.586 | +0.028 |
The model shows consistent improvement on easy and hard tasks (+5.3% to +5.6% on easy, +3.9% on hard), while medium-difficulty tasks remain stable. This pattern suggests the model learned domain-specific retrieval and reasoning strategies rather than shallow pattern matching.
Training Curves
Left: GRPO training loss over 80 steps β shows characteristic sharp drops during policy updates followed by stabilization.
Right: Mean reward (grader score) over training β fluctuates between 0.3-0.7 with an overall upward trend.
Baseline vs. GRPO-Trained
Side-by-side comparison across all 5 aerospace tasks. The GRPO-trained model (gold) outperforms or matches the baseline (dark red) on every task.
Score Distribution
The GRPO-trained model's score distribution shifts rightward compared to baseline, indicating more consistently higher scores.
Intended Use
This model is designed for agentic RAG tasks in aerospace research. It's trained to:
- Generate effective retrieval queries for aerospace topics
- Reason over technical documents (propulsion, materials, orbital mechanics)
- Produce structured, evidence-grounded technical answers
- Self-critique and refine its research strategy
Links
| Resource | URL |
|---|---|
| HF Space (Live Demo) | Agentic RAG Gym |
| GitHub Repository | agentic-rag-gym |
| Training Notebook | Google Colab |
| Base Model | Qwen/Qwen2.5-0.5B-Instruct |
| Agentic RAG OS | Live Platform |
Citation
@misc{agentic-rag-gym-2026,
title={Agentic RAG Gym: An RL Framework for Training AI Research Agents},
author={williyam},
year={2026},
url={https://github.com/williyam-m/agentic-rag-gym}
}
Built for the Meta Γ OpenEnv Γ Hugging Face Γ PyTorch Hackathon
- Downloads last month
- 31
Model tree for williyam/agentic-rag-aerospace-grpo
Space using williyam/agentic-rag-aerospace-grpo 1
Evaluation results
- Mean Grader Score (GRPO-Trained)self-reported0.586
- Mean Grader Score (Baseline)self-reported0.558


