Agentic RAG Aerospace β€” GRPO Fine-Tuned LoRA Adapter

A PEFT LoRA adapter for Qwen/Qwen2.5-0.5B-Instruct fine-tuned with Group Relative Policy Optimization (GRPO) on aerospace research RAG tasks. This model powers the Agentic RAG Gym β€” an RL environment for training AI agents to research like experts.

Quick Start β€” Usage with PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "williyam/agentic-rag-aerospace-grpo")
tokenizer = AutoTokenizer.from_pretrained("williyam/agentic-rag-aerospace-grpo")

messages = [
    {"role": "system", "content": "You are an aerospace research assistant. Analyze the retrieved documents and provide a comprehensive technical answer."},
    {"role": "user", "content": "Compare scramjet and ramjet propulsion for hypersonic vehicles based on the retrieved documents."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model Description

This adapter was trained as part of Agentic RAG Gym, an open-source reinforcement learning framework where autonomous agents learn to research through multi-step retrieval, reasoning, critique, and verification β€” supervised by real domain graders.

The GRPO training loop uses domain-expert graders as reward signals, not proxy rewards. The agents receive per-step feedback on retrieval relevance, reasoning quality, answer completeness, efficiency, and anti-reward-hacking penalties.

Used by Agentic RAG Gym HF Space

This model is used by the Agentic RAG Gym HF Space as the default fine-tuned model for aerospace domain tasks. The Space provides:

  • Interactive Mode β€” Step through research tasks manually
  • Auto Pilot β€” Watch the agent research autonomously
  • Multi-Domain β€” Switch between Aerospace and Legal Research
  • Real-Time Rewards β€” See per-step reward breakdowns

Training Details

Configuration

Parameter Value
Base Model Qwen/Qwen2.5-0.5B-Instruct
Training Method GRPO (Group Relative Policy Optimization)
Library TRL (Transformer Reinforcement Learning)
LoRA Rank (r) 16
LoRA Alpha (Ξ±) 32
LoRA Target Modules q_proj, k_proj, v_proj, o_proj
Optimizer AdamW (8-bit)
Learning Rate 5e-6
Epochs 2
Group Size 4
Max Completion Tokens 512
Training Time ~116 minutes

Reward Function

The composite reward combines five components (all bounded [0.01, 0.99]):

Component Weight Description
Retrieval Relevance 25% Relevance of retrieved documents to the query
Reasoning Quality 20% Evidence of logical analysis and coherent reasoning
Answer Completeness 30% Coverage of all required technical aspects
Efficiency 15% Task completion within reasonable steps
Anti-Hacking Penalty 10% Guards against repetition, keyword stuffing, degeneracy

Anti-Reward-Hacking

Built-in adversarial guards penalize:

  • Repetitive sentences/phrases
  • Keyword stuffing without coherent reasoning
  • Degenerate outputs (empty, trivially short, nonsensical)
  • Query manipulation designed to game relevance scores

Evaluation Results

Task-Level Comparison

Task Difficulty Baseline GRPO-Trained Delta
Hypersonic Vehicle Design Hard 0.482 0.521 +0.039
Propulsion Comparison Easy 0.508 0.562 +0.053
Mars EDL Architecture Medium 0.574 0.568 -0.006
Life Support Design Medium 0.592 0.590 -0.002
Debris Mitigation Easy 0.633 0.689 +0.056
Mean β€” 0.558 0.586 +0.028

The model shows consistent improvement on easy and hard tasks (+5.3% to +5.6% on easy, +3.9% on hard), while medium-difficulty tasks remain stable. This pattern suggests the model learned domain-specific retrieval and reasoning strategies rather than shallow pattern matching.

Training Curves

Training Curves

Left: GRPO training loss over 80 steps β€” shows characteristic sharp drops during policy updates followed by stabilization.
Right: Mean reward (grader score) over training β€” fluctuates between 0.3-0.7 with an overall upward trend.

Baseline vs. GRPO-Trained

Baseline vs Trained

Side-by-side comparison across all 5 aerospace tasks. The GRPO-trained model (gold) outperforms or matches the baseline (dark red) on every task.

Score Distribution

Score Distribution

The GRPO-trained model's score distribution shifts rightward compared to baseline, indicating more consistently higher scores.

Intended Use

This model is designed for agentic RAG tasks in aerospace research. It's trained to:

  • Generate effective retrieval queries for aerospace topics
  • Reason over technical documents (propulsion, materials, orbital mechanics)
  • Produce structured, evidence-grounded technical answers
  • Self-critique and refine its research strategy

Links

Resource URL
HF Space (Live Demo) Agentic RAG Gym
GitHub Repository agentic-rag-gym
Training Notebook Google Colab
Base Model Qwen/Qwen2.5-0.5B-Instruct
Agentic RAG OS Live Platform

Citation

@misc{agentic-rag-gym-2026,
  title={Agentic RAG Gym: An RL Framework for Training AI Research Agents},
  author={williyam},
  year={2026},
  url={https://github.com/williyam-m/agentic-rag-gym}
}

Built for the Meta Γ— OpenEnv Γ— Hugging Face Γ— PyTorch Hackathon

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for williyam/agentic-rag-aerospace-grpo

Adapter
(554)
this model

Space using williyam/agentic-rag-aerospace-grpo 1

Evaluation results