Quotebound 27B

The standalone model release from Evidence-Faithful Reasoning, built on the Qwen 3.5 Opus Distilled 27B base.

Quotebound 27B is the downloadable model release for Evidence-Faithful Reasoning: a LoRA adapter that turns its reasoning-distilled 27B base model into an evidence-first reader for closed packets of source text. Every answer has to land on the right evidence units, quote them verbatim, and stop with Insufficient evidence. when the packet does not justify a claim.

Fresh public holdout: Quotebound 27B versus the prior bridge model

On a fresh 36-task public holdout, Quotebound 27B improves task accuracy, evidence F1, and quote F1 over the prior bridge model. The packet-local quote normalizer carries the full stack to 0.9093 quote F1.

At a glance

  • What it is. A LoRA adapter on top of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2, trained to answer from closed packets of source text under a strict answer–evidence–quote–abstain contract.
  • The headline number. Raw quote F1 on a fresh public holdout roughly doubles over the prior bridge model (0.33430.6815), meaning much more of the grounding behavior now lives inside the model itself instead of in a post-processing layer.
  • Other deltas on the same holdout. Raw task: 0.86110.8889. Raw strict: 0.22220.4444. Raw evidence F1: 0.88150.9093. Zero invalid outputs across every reported evaluation surface.
  • What it isn't. Not a general chatbot. Not a replacement for the benchmark-winning hybrid system, which is described below as a separate result.

Read next

Quick start

Load the 27B base model and attach the adapter:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
adapter_id = "darcar0/quotebound-27b"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)

The base is a 27B-parameter model, so load it in whichever quantization your hardware supports (4-bit bitsandbytes works for inference).

The contract

Each task arrives with a closed packet of source text. To count as a success, the model has to clear four conditions on the same answer:

  1. Answer correctly — return the right answer or label for the task.
  2. Pick the right evidence — the cited units must be the packet locations that actually support the answer.
  3. Quote exact support — every quote is a verbatim substring of its cited unit. No paraphrase, no stitching, no ellipsis.
  4. Abstain when blocked — if the packet does not justify a claim, the answer must be exactly Insufficient evidence.

Correctness alone is not credited. The model has been trained to fail closed when the packet runs out, and to ground every answer it does return.

Prompt format

The model is trained for an evidence-first prompt that makes the answer subordinate to the cited text. A minimal version:

You are answering from a bounded evidence packet only.

Work in this order:
1. Identify the smallest set of packet units that matters.
2. Copy exact quote(s) from those units.
3. Only then give the final answer.

Rules:
- No outside facts.
- Return valid JSON only.
- Every quote must be a verbatim substring of the cited unit.
- Do not paraphrase, ellipsize, or stitch quotes.
- If the packet is insufficient, the `answer` field must be exactly
  `Insufficient evidence.`

The model then writes a JSON object with this shape:

{
  "task_id": "<task id>",
  "label": "support|contradict|insufficient|null",
  "answer": "<one-sentence answer>",
  "evidence_ids": ["unit_id_1", "unit_id_2"],
  "quotes": [
    {"unit_id": "unit_id_1", "quote": "<exact quote>"}
  ],
  "justification": "<one short sentence tied to the cited evidence>"
}

Evaluation

Fresh 36-task mixed public holdout

A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA grounded-QA tasks, drawn from public sources and de-duplicated against every training, dev, and held-out probe row.

Stack Task Strict Evidence F1 Quote F1
Bridge raw 0.8611 0.2222 0.8815 0.3343
Quotebound raw 0.8889 0.4444 0.9093 0.6815
Bridge + deterministic_v3 0.8611 0.5833 0.8815 0.8815
Quotebound + deterministic_v3 0.8889 0.5833 0.9093 0.9093

Quotebound 27B beats the prior bridge model on task accuracy, evidence F1, and quote F1 in both raw and normalized form, ties normalized strict, and roughly doubles raw quote F1 at the model level.

Fixed dev triage slice (21 tasks)

Stack Task Strict Evidence F1 Quote F1
Quotebound + deterministic_v3 1.0000 0.6190 0.8320 0.7095

Untouched 104-task HotpotQA shadow slice

On a 104-task HotpotQA shadow slice that was never touched during selection, Quotebound raw improved quote-faithful behavior over the prior bridge model, and Quotebound plus deterministic_v3 matched bridge + deterministic_v3 at the system level. The surface is reported as a narrative parity result because the freeze memo does not publish per-metric cells for it.

Release architecture

The project ends in two finished results that are reported separately on purpose. One is the strongest full system on the held-out benchmark; the other is the strongest standalone model — and the artifact you can actually download.

  1. Quotebound 27B — this page. The adapter above is the strongest version of the project's evidence-faithful behavior that moved into the model itself, evaluated across multiple surfaces beyond the held-out probe.
  2. The benchmark-winning hybrid system. A trained bridge checkpoint plus the deterministic_v3 packet-local quote normalizer. That stack is the only configuration that clears every gate of the strict contract on the frozen held-out probe (probe_v0).

The two results do not collapse into one. The hybrid system is the benchmark winner. Quotebound 27B is the downloadable model. Perfect probe_v0 belongs to the hybrid system, not to the adapter on this page alone.

Intended use

Use this release for work that has to stay inside a fixed body of text:

  • bounded document QA with explicit evidence requirements,
  • claim verification and grounded QA from closed packets of source text,
  • policy, compliance, contract, and internal-document workflows where each answer has to be justified from the provided text,
  • research on evidence-faithful reasoning and abstention behavior.

Limitations

  • The download is the LoRA adapter only — the 27B base model is required.
  • The deterministic_v3 packet-local quote normalizer is not shipped here. It lives in the project repository as a separate post-processing step. Quotebound 27B alone reproduces the raw standalone gains above; normalized system-level rows require adapter + normalizer.
  • Perfect probe_v0 belongs to the benchmark-winning hybrid system, not to this adapter alone.
  • Specialized for closed-packet reasoning. Behavior outside that setting — open chat, open-domain QA, free-form generation — is not characterized.
  • Raw item-level contents of the held-out probe are intentionally not published with the release; the held-out gate has to stay closed to remain meaningful.

Citation and references

@misc{quotebound_27b_2026,
  title        = {Quotebound 27B: Evidence-Faithful Reasoning Standalone Release},
  author       = {{darcar0}},
  year         = {2026},
  howpublished = {Hugging Face model release},
  url          = {https://huggingface.co/darcar0/quotebound-27b}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for darcar0/quotebound-27b

Datasets used to train darcar0/quotebound-27b