Qwen2.5-0.5B-Math-OREO

A mathematical reasoning model based on Qwen2.5-0.5B-Instruct, enhanced through Supervised Fine-Tuning (SFT) and OREO Offline Reinforcement Learning.

Model Description

This model is designed for grade-school math word problems with step-by-step reasoning. It uses a structured thinking format with <think> tags for chain-of-thought reasoning.

Training Pipeline

  1. Supervised Fine-Tuning (SFT): Fine-tuned on ~60K math problems
  2. OREO Offline RL: Post-trained using the OREO (Offline REasoning Optimization) algorithm

Performance

Benchmark Metric Score
GSM8K 8-shot (lm-harness) 46.4%

Base Qwen2.5-0.5B-Instruct achieves 41.6% on GSM8K.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "golfoscar/qwen2.5-0.5b-math-oreo"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Format your prompt
question = "Sarah has 5 apples. She buys 3 more apples and gives 2 to her friend. How many apples does Sarah have now?"

messages = [
    {"role": "system", "content": """You are a math problem solver. Follow these steps:
1. Solve the problem inside <think> tags.
2. Break down the solution into numbered steps (1., 2., etc.).
3. Write out every calculation explicitly (e.g., 2 + 3 = 5).
4. Output the final numerical answer after #### following the </think> tag.
Output format: <think>reasoning</think>
#### Number"""},
    {"role": "user", "content": question}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Expected Output Format

<think>
Sarah starts with 5 apples.
She buys 3 more, so she has 5 + 3 = 8 apples.
Then she gives 2 apples to her friend, leaving her with 8 - 2 = 6 apples.
</think>
#### 6

Special Tokens

This model uses custom special tokens for structured reasoning:

  • <think> - Start of reasoning section
  • </think> - End of reasoning section

The final answer should appear after ####.

Training Details

Supervised Fine-Tuning (SFT)

Parameter Value
Base Model Qwen2.5-0.5B-Instruct
Dataset Size ~60,000 samples
Dataset Composition Orca-Math (60%), GSM8K (12%), MetaMath (28%)
Learning Rate 2e-5
Batch Size 8 per GPU × 8 GPUs
Gradient Accumulation 4
Epochs 3
Max Sequence Length 2048
Optimizer AdamW
LR Scheduler Cosine with warmup
Warmup Ratio 0.03

OREO Offline RL

Parameter Value
Algorithm OREO (Offline REasoning Optimization)
Offline Dataset Trajectories sampled from SFT model
Reward Binary (1.0 for correct, 0.0 for incorrect)
β (Soft Bellman Temperature) 1.5
KL Coefficient 0.5
Policy Loss Weight 0.05
Learning Rate (Policy) 1e-6
Learning Rate (Value Head) 1e-5
Batch Size 1 per GPU × 8 GPUs × 16 gradient accumulation
Effective Batch Size 128
Value Head Architecture MLP (hidden_size → 256 → 1)
Epochs 1
Optimizer AdamW

Limitations

  • Optimized for grade-school level math (GSM8K style)
  • May struggle with advanced mathematics, calculus, or abstract algebra
  • English only
  • Small model size (0.5B parameters) limits complex reasoning

Citation

If you use this model, please cite the OREO paper:

@article{wang2024oreo,
  title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
  author={Wang, Jianhao and others},
  journal={arXiv preprint},
  year={2024}
}

License

This model is released under the Apache 2.0 license, following the base Qwen2.5 model license.

Downloads last month
1
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for golfoscar/qwen2.5-0.5b-math-oreo

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(641)
this model

Datasets used to train golfoscar/qwen2.5-0.5b-math-oreo