Quotebound 27B
The standalone model release from Evidence-Faithful Reasoning, built on the Qwen 3.5 Opus Distilled 27B base.
Quotebound 27B is the downloadable model release for
Evidence-Faithful Reasoning: a LoRA adapter that turns its
reasoning-distilled 27B base model into an evidence-first reader for
closed packets of source text. Every answer has to land on the right
evidence units, quote them verbatim, and stop with
Insufficient evidence. when the packet does not justify a claim.
On a fresh 36-task public holdout, Quotebound 27B improves task accuracy,
evidence F1, and quote F1 over the prior bridge model. The packet-local
quote normalizer carries the full stack to 0.9093 quote F1.
At a glance
- What it is. A LoRA adapter on top of
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2, trained to answer from closed packets of source text under a strict answer–evidence–quote–abstain contract. - The headline number. Raw quote F1 on a fresh public holdout roughly
doubles over the prior bridge model (
0.3343→0.6815), meaning much more of the grounding behavior now lives inside the model itself instead of in a post-processing layer. - Other deltas on the same holdout. Raw task:
0.8611→0.8889. Raw strict:0.2222→0.4444. Raw evidence F1:0.8815→0.9093. Zero invalid outputs across every reported evaluation surface. - What it isn't. Not a general chatbot. Not a replacement for the benchmark-winning hybrid system, which is described below as a separate result.
Read next
- Technical note — full method, results, and discussion.
- Frozen benchmark progression chart
Quick start
Load the 27B base model and attach the adapter:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
adapter_id = "darcar0/quotebound-27b"
tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
The base is a 27B-parameter model, so load it in whichever quantization
your hardware supports (4-bit bitsandbytes works for inference).
The contract
Each task arrives with a closed packet of source text. To count as a success, the model has to clear four conditions on the same answer:
- Answer correctly — return the right answer or label for the task.
- Pick the right evidence — the cited units must be the packet locations that actually support the answer.
- Quote exact support — every quote is a verbatim substring of its cited unit. No paraphrase, no stitching, no ellipsis.
- Abstain when blocked — if the packet does not justify a claim,
the answer must be exactly
Insufficient evidence.
Correctness alone is not credited. The model has been trained to fail closed when the packet runs out, and to ground every answer it does return.
Prompt format
The model is trained for an evidence-first prompt that makes the answer subordinate to the cited text. A minimal version:
You are answering from a bounded evidence packet only.
Work in this order:
1. Identify the smallest set of packet units that matters.
2. Copy exact quote(s) from those units.
3. Only then give the final answer.
Rules:
- No outside facts.
- Return valid JSON only.
- Every quote must be a verbatim substring of the cited unit.
- Do not paraphrase, ellipsize, or stitch quotes.
- If the packet is insufficient, the `answer` field must be exactly
`Insufficient evidence.`
The model then writes a JSON object with this shape:
{
"task_id": "<task id>",
"label": "support|contradict|insufficient|null",
"answer": "<one-sentence answer>",
"evidence_ids": ["unit_id_1", "unit_id_2"],
"quotes": [
{"unit_id": "unit_id_1", "quote": "<exact quote>"}
],
"justification": "<one short sentence tied to the cited evidence>"
}
Evaluation
Fresh 36-task mixed public holdout
A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA tasks, drawn from public sources and de-duplicated against every training, dev, and held-out probe row.
| Stack | Task | Strict | Evidence F1 | Quote F1 |
|---|---|---|---|---|
| Bridge raw | 0.8611 | 0.2222 | 0.8815 | 0.3343 |
| Quotebound raw | 0.8889 | 0.4444 | 0.9093 | 0.6815 |
Bridge + deterministic_v3 |
0.8611 | 0.5833 | 0.8815 | 0.8815 |
Quotebound + deterministic_v3 |
0.8889 | 0.5833 | 0.9093 | 0.9093 |
Quotebound 27B beats the prior bridge model on task accuracy, evidence F1, and quote F1 in both raw and normalized form, ties normalized strict, and roughly doubles raw quote F1 at the model level.
Fixed dev triage slice (21 tasks)
| Stack | Task | Strict | Evidence F1 | Quote F1 |
|---|---|---|---|---|
Quotebound + deterministic_v3 |
1.0000 | 0.6190 | 0.8320 | 0.7095 |
Untouched 104-task HotpotQA shadow slice
On a 104-task HotpotQA shadow slice that was never touched during
selection, Quotebound raw improved quote-faithful behavior over the prior
bridge model, and Quotebound plus deterministic_v3 matched bridge +
deterministic_v3 at the system level. The surface is reported as a
narrative parity result because the freeze memo does not publish
per-metric cells for it.
Release architecture
The project ends in two finished results that are reported separately on purpose. One is the strongest full system on the held-out benchmark; the other is the strongest standalone model — and the artifact you can actually download.
- Quotebound 27B — this page. The adapter above is the strongest version of the project's evidence-faithful behavior that moved into the model itself, evaluated across multiple surfaces beyond the held-out probe.
- The benchmark-winning hybrid system. A trained bridge checkpoint
plus the
deterministic_v3packet-local quote normalizer. That stack is the only configuration that clears every gate of the strict contract on the frozen held-out probe (probe_v0).
The two results do not collapse into one. The hybrid system is the
benchmark winner. Quotebound 27B is the downloadable model. Perfect
probe_v0 belongs to the hybrid system, not to the adapter on this page
alone.
Intended use
Use this release for work that has to stay inside a fixed body of text:
- bounded document QA with explicit evidence requirements,
- claim verification and grounded QA from closed packets of source text,
- policy, compliance, contract, and internal-document workflows where each answer has to be justified from the provided text,
- research on evidence-faithful reasoning and abstention behavior.
Limitations
- The download is the LoRA adapter only — the 27B base model is required.
- The
deterministic_v3packet-local quote normalizer is not shipped here. It lives in the project repository as a separate post-processing step. Quotebound 27B alone reproduces the raw standalone gains above; normalized system-level rows require adapter + normalizer. - Perfect
probe_v0belongs to the benchmark-winning hybrid system, not to this adapter alone. - Specialized for closed-packet reasoning. Behavior outside that setting — open chat, open-domain QA, free-form generation — is not characterized.
- Raw item-level contents of the held-out probe are intentionally not published with the release; the held-out gate has to stay closed to remain meaningful.
Citation and references
- Base model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
- Datasets: fever/fever, hotpotqa/hotpot_qa
- Technical note: technical_note_evidence_faithful_reasoning.md
@misc{quotebound_27b_2026,
title = {Quotebound 27B: Evidence-Faithful Reasoning Standalone Release},
author = {{darcar0}},
year = {2026},
howpublished = {Hugging Face model release},
url = {https://huggingface.co/darcar0/quotebound-27b}
}
Model tree for darcar0/quotebound-27b
Base model
Qwen/Qwen3.5-27B