DSL Debug 7B — SFT Step 100

Qwen2.5-7B-Instruct fine-tuned on 1,593 debugging trajectories for the DSL Debug benchmark.

Training

  • Method: Supervised fine-tuning (verl 0.7 FSDP)
  • Data: 1,593 multi-turn trajectories with tool calls (run, inspect, read_docs, submit)
  • Base model: Qwen2.5-7B-Instruct
  • Epochs: 2 (step 100 checkpoint)
  • LR: 5e-6
  • Hardware: 2x A100-SXM4-80GB

Results (held-out test, one-shot)

Split Base Model This Model
Standard (481) 50.5% 56.3%
Nonlocal (200) 12.0% 40.0%
Intent-Mismatch (177) 0.6% 7.9%

Alignment Tax (general capabilities)

Benchmark Base This Model
MMLU 74.6% 74.6%
GSM8K 84.9% 83.9%
HumanEval 65.9% 62.2%

Usage

This checkpoint is primarily used as the starting point for SFT→RL training (GRPO), which achieves the best results. See the collection for all models.

from huggingface_hub import snapshot_download
snapshot_download("andrewlngdn/dsl-debug-7b-sft-step100",
    local_dir="/workspace/models/sft_7b_step100")
Downloads last month
24
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andrewlngdn/dsl-debug-7b-sft-step100

Base model

Qwen/Qwen2.5-7B
Finetuned
(2827)
this model

Collection including andrewlngdn/dsl-debug-7b-sft-step100