🌟 Jackrong/Qwen3.5-9B-Neo

📢 Announcement

Neo Update: This iteration focuses extensively on achieving meaningful gains in reasoning and mathematical performance while maintaining competitive general accuracy.

Neo introduces a highly optimized reasoning scaffold designed to eliminate redundant internal loops and circular reasoning. Unlike standard models that simply think longer when faced with difficult tasks, Neo is built to think smarter, not longer. Evaluated on the LM Evaluation Harness leaderboard suite, it delivers notable improvements in BBH (+0.87 pp), MATH Hard (+0.98 pp), and MUSR (+2.91 pp) — the benchmarks that most directly probe structured multi-step reasoning and logical deduction.

💡 Model Introduction

Jackrong/Qwen3.5-9B-Neo is a reasoning-focused fine-tune of the Qwen3.5-9B model. Its primary objective is to improve the quality of chain-of-thought generation, with a particular focus on harder reasoning and mathematical tasks, while remaining broadly competitive across general academic benchmarks.

The goal of this Neo model is not simply to make the model "think more," but to help it think more structuredly: eliminating unnecessary verbose over-analysis, anchoring intermediate steps, and improving multi-hop logical consistency. Based on the LM Eval Harness leaderboard evaluation (conducted by community member selimaktas), the 9B-Neo achieves improvements on three of the four sub-benchmarks — BBH, MATH Hard, and MUSR — with a notable +2.91 pp gain on MUSR (multi-step reasoning under uncertainty) and +0.98 pp on MATH Hard.

LM Eval Harness Benchmark Results 🪐

Group / Task	Metric	Qwen3.5-9B	Qwen3.5-9B-Neo	Δ
leaderboard_bbh	acc_norm ↑	0.6190	0.6277	+0.87 pp
leaderboard_gpqa	acc_norm ↑	0.4446	0.4136	−3.10 pp
leaderboard_math_hard	exact_match ↑	0.3965	0.4063	+0.98 pp
leaderboard_musr	acc_norm ↑	0.4339	0.4630	+2.91 pp

📌 Note on IFEval: The instruction-following metrics (IFEval prompt/inst level) show a regression in this version. This is a known trade-off from the current training pipeline's emphasis on structured reasoning over format-following, and will be an area of focus in future iterations.

Evaluation conducted by community member selimaktas using the LM Evaluation Harness framework. March 2026.

🗺️ Training Pipeline Overview

Base Model (Qwen/Qwen3.5-9B)
 │
 ▼
Qwen3.5-9B fine-tuned with Unsloth
 │
 ▼
Supervised Fine-Tuning (SFT) + LoRA
(Response-Only Training masked on "<|im_start|>assistant\n<think>")
 │
 ▼
Jackrong/Qwen3.5-9B-Neo

🧠 Example of Learned Reasoning Scaffold

Through robust data cleaning and formatting, the model was conditioned to explicitly structure its thought processes inside <think>...</think> tags before emitting the final answer. This forces the model to methodically break down complex programming or logical problems without repetitive thoughts.

<|im_start|>user
[User Query here]<|im_end|>
<|im_start|>assistant
<think>
    ...
</think>
[Final concise and accurate answer]

📚 All Datasets Used

The dataset consists of high-quality, filtered reasoning distillation data merged during the pipeline. Our pipeline dynamically sampled and structured conversations, strictly maintaining the intended layout.

stepfun-ai/Step-3.5-Flash-SFT
Jackrong/Competitive-Programming-python-blend (A custom curated blend specifically for Python competitive programming and logic).

Detailed breakdown of the Competitive-Programming-python-blend:

Source	Role in the Blend
`nohurry/Opus-4.6-Reasoning-3000x-filtered`	Reasoning-heavy synthetic SFT data
`Jackrong/Qwen3.5-reasoning-700x`	Distilled reasoning and instruction-following data
`nvidia/Nemotron-SFT-Competitive-Programming-v2` (`competitive_coding_python`)	Primary Python competitive-programming supervision
`nvidia/Nemotron-SFT-Competitive-Programming-v2` (`competitive_coding_cpp`)	Small cross-language competitive-programming supplement
`nvidia/Nemotron-SFT-SWE-v2` (`agentless`)	Lightweight agentless SWE-style supervision
`nvidia/Nemotron-SFT-Instruction-Following-Chat-v2` (`reasoning_on`)	Small reasoning-oriented chat supplement

⚠️ Limitations & Intended Use

Instruction Following Regression: This version shows reduced IFEval scores compared to the base model, reflecting the current pipeline's trade-off toward structured reasoning over rigid format compliance. This will be addressed in future iterations.
Hallucination Risk: While reasoning is strong, the model remains an autoregressive LLM; external facts provided during the thinking sequence may occasionally contain hallucinations if verifying real-world events.
Context Boundaries: In rare cases of extremely complex logic where the model struggles to converge, it may exhibit truncation events from excessive circular thinking.
Intended Scenario: Best suited for offline analytical tasks, coding, competitive programming, math, and heavy logic-dependent prompting where the user needs to transparently follow the AI's internal logic with high token efficiency.
This model is a test version intended solely for learning and demonstration purposes, and is for academic research and technical exploration use only.

🙏 Acknowledgements

Significant thanks to the Unsloth AI team for making rapid fine-tuning of large LLM models accessible. Additionally, we acknowledge Qwen internally, and the open-source community developers producing exceptional distilled datasets. Special thanks to selimaktas for conducting the LM Eval Harness benchmark evaluation.