Qwen2.5-0.5B-Math-OREO
A mathematical reasoning model based on Qwen2.5-0.5B-Instruct, enhanced through Supervised Fine-Tuning (SFT) and OREO Offline Reinforcement Learning.
Model Description
This model is designed for grade-school math word problems with step-by-step reasoning. It uses a structured thinking format with <think> tags for chain-of-thought reasoning.
Training Pipeline
- Supervised Fine-Tuning (SFT): Fine-tuned on ~60K math problems
- OREO Offline RL: Post-trained using the OREO (Offline REasoning Optimization) algorithm
Performance
| Benchmark | Metric | Score |
|---|---|---|
| GSM8K | 8-shot (lm-harness) | 46.4% |
Base Qwen2.5-0.5B-Instruct achieves 41.6% on GSM8K.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "golfoscar/qwen2.5-0.5b-math-oreo"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
# Format your prompt
question = "Sarah has 5 apples. She buys 3 more apples and gives 2 to her friend. How many apples does Sarah have now?"
messages = [
{"role": "system", "content": """You are a math problem solver. Follow these steps:
1. Solve the problem inside <think> tags.
2. Break down the solution into numbered steps (1., 2., etc.).
3. Write out every calculation explicitly (e.g., 2 + 3 = 5).
4. Output the final numerical answer after #### following the </think> tag.
Output format: <think>reasoning</think>
#### Number"""},
{"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)
Expected Output Format
<think>
Sarah starts with 5 apples.
She buys 3 more, so she has 5 + 3 = 8 apples.
Then she gives 2 apples to her friend, leaving her with 8 - 2 = 6 apples.
</think>
#### 6
Special Tokens
This model uses custom special tokens for structured reasoning:
<think>- Start of reasoning section</think>- End of reasoning section
The final answer should appear after ####.
Training Details
Supervised Fine-Tuning (SFT)
| Parameter | Value |
|---|---|
| Base Model | Qwen2.5-0.5B-Instruct |
| Dataset Size | ~60,000 samples |
| Dataset Composition | Orca-Math (60%), GSM8K (12%), MetaMath (28%) |
| Learning Rate | 2e-5 |
| Batch Size | 8 per GPU × 8 GPUs |
| Gradient Accumulation | 4 |
| Epochs | 3 |
| Max Sequence Length | 2048 |
| Optimizer | AdamW |
| LR Scheduler | Cosine with warmup |
| Warmup Ratio | 0.03 |
OREO Offline RL
| Parameter | Value |
|---|---|
| Algorithm | OREO (Offline REasoning Optimization) |
| Offline Dataset | Trajectories sampled from SFT model |
| Reward | Binary (1.0 for correct, 0.0 for incorrect) |
| β (Soft Bellman Temperature) | 1.5 |
| KL Coefficient | 0.5 |
| Policy Loss Weight | 0.05 |
| Learning Rate (Policy) | 1e-6 |
| Learning Rate (Value Head) | 1e-5 |
| Batch Size | 1 per GPU × 8 GPUs × 16 gradient accumulation |
| Effective Batch Size | 128 |
| Value Head Architecture | MLP (hidden_size → 256 → 1) |
| Epochs | 1 |
| Optimizer | AdamW |
Limitations
- Optimized for grade-school level math (GSM8K style)
- May struggle with advanced mathematics, calculus, or abstract algebra
- English only
- Small model size (0.5B parameters) limits complex reasoning
Citation
If you use this model, please cite the OREO paper:
@article{wang2024oreo,
title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
author={Wang, Jianhao and others},
journal={arXiv preprint},
year={2024}
}
License
This model is released under the Apache 2.0 license, following the base Qwen2.5 model license.
- Downloads last month
- 1