jaimef21's picture
Drop MLX reference
1395d27 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
tags:
  - gerbil
  - scheme
  - lora
  - code
  - dpo
language:
  - en

gerbil-qwen3-coder-30b-bf16

Qwen3-Coder-30B-A3B-Instruct fine-tuned for Gerbil Scheme generation.

Training pipeline and tooling: https://github.com/ober/gerbil-lora

Training pipeline

Three-stage LoRA fine-tune (r=32, α=64, fused-MoE expert targets):

  1. CPT — Continued pre-training on Gerbil source corpus (lr 2e-5, 2 epochs)
  2. SFT — Supervised fine-tune on instruction/response pairs (lr 1e-4, 2 epochs)
  3. DPO — Direct preference optimization on wrong→right pairs (lr 5e-6, 3 epochs)

DPO eval (vs base Qwen3-Coder-30B-A3B-Instruct)

Metric Base Trained Δ
Holdout task score 31 39 +8
Anti-idioms hit 1 0 -1
Code blocks wrapped 9 14 +5
tok_lean_sum (P(chosen) > P(rejected)) -4.17 +4.03 +8.19
wins chosen / rejected (n=66) 47 / 19 52 / 13 +5 / -6

Weights

BF16 merged weights, ~57 GB across 13 safetensors shards.