Qwen2.5-0.5B-Math-OREO

A mathematical reasoning model based on Qwen2.5-0.5B-Instruct, enhanced through Supervised Fine-Tuning (SFT) and OREO Offline Reinforcement Learning.

Model Description

This model is designed for grade-school math word problems with step-by-step reasoning. It uses a structured thinking format with <think> tags for chain-of-thought reasoning.

Training Pipeline

Supervised Fine-Tuning (SFT): Fine-tuned on ~60K math problems
OREO Offline RL: Post-trained using the OREO (Offline REasoning Optimization) algorithm

Performance

Benchmark	Metric	Score
GSM8K	8-shot (lm-harness)	46.4%

Base Qwen2.5-0.5B-Instruct achieves 41.6% on GSM8K.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "golfoscar/qwen2.5-0.5b-math-oreo"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Format your prompt
question = "Sarah has 5 apples. She buys 3 more apples and gives 2 to her friend. How many apples does Sarah have now?"

messages = [
    {"role": "system", "content": """You are a math problem solver. Follow these steps:
1. Solve the problem inside <think> tags.
2. Break down the solution into numbered steps (1., 2., etc.).
3. Write out every calculation explicitly (e.g., 2 + 3 = 5).
4. Output the final numerical answer after #### following the </think> tag.
Output format: <think>reasoning</think>
#### Number"""},
    {"role": "user", "content": question}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Expected Output Format

<think>
Sarah starts with 5 apples.
She buys 3 more, so she has 5 + 3 = 8 apples.
Then she gives 2 apples to her friend, leaving her with 8 - 2 = 6 apples.
</think>
#### 6

Special Tokens

This model uses custom special tokens for structured reasoning:

<think> - Start of reasoning section
</think> - End of reasoning section

The final answer should appear after ####.

Training Details

Supervised Fine-Tuning (SFT)

Parameter	Value
Base Model	Qwen2.5-0.5B-Instruct
Dataset Size	~60,000 samples
Dataset Composition	Orca-Math (60%), GSM8K (12%), MetaMath (28%)
Learning Rate	2e-5
Batch Size	8 per GPU × 8 GPUs
Gradient Accumulation	4
Epochs	3
Max Sequence Length	2048
Optimizer	AdamW
LR Scheduler	Cosine with warmup
Warmup Ratio	0.03

OREO Offline RL

Parameter	Value
Algorithm	OREO (Offline REasoning Optimization)
Offline Dataset	Trajectories sampled from SFT model
Reward	Binary (1.0 for correct, 0.0 for incorrect)
β (Soft Bellman Temperature)	1.5
KL Coefficient	0.5
Policy Loss Weight	0.05
Learning Rate (Policy)	1e-6
Learning Rate (Value Head)	1e-5
Batch Size	1 per GPU × 8 GPUs × 16 gradient accumulation
Effective Batch Size	128
Value Head Architecture	MLP (hidden_size → 256 → 1)
Epochs	1
Optimizer	AdamW

Limitations

Optimized for grade-school level math (GSM8K style)
May struggle with advanced mathematics, calculus, or abstract algebra
English only
Small model size (0.5B parameters) limits complex reasoning

Citation

If you use this model, please cite the OREO paper:

@article{wang2024oreo,
  title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
  author={Wang, Jianhao and others},
  journal={arXiv preprint},
  year={2024}
}

License

This model is released under the Apache 2.0 license, following the base Qwen2.5 model license.

Downloads last month: 1

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for golfoscar/qwen2.5-0.5b-math-oreo

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct