darcar0 commited on
Commit
3a33e72
·
verified ·
1 Parent(s): b314f67

Polish model card structure and copy

Browse files
Files changed (1) hide show
  1. README.md +132 -113
README.md CHANGED
@@ -4,6 +4,7 @@ language:
4
  license: apache-2.0
5
  base_model:
6
  - Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
 
7
  datasets:
8
  - fever/fever
9
  - hotpotqa/hotpot_qa
@@ -12,6 +13,7 @@ pipeline_tag: text-generation
12
  tags:
13
  - reasoning
14
  - evidence-grounding
 
15
  - attribution
16
  - fever
17
  - hotpotqa
@@ -23,47 +25,57 @@ tags:
23
 
24
  # Quotebound 27B
25
 
26
- *The standalone model release from Evidence-Faithful Reasoning, built on the
27
- Qwen 3.5 Opus Distilled 27B base.*
28
 
29
- Quotebound 27B is the downloadable model release for
30
- Evidence-Faithful Reasoning: a LoRA adapter that turns its
31
- reasoning-distilled 27B base model into an evidence-first reader for
32
- closed packets of source text. Every answer has to land on the right
33
- evidence units, quote them verbatim, and stop with
34
- `Insufficient evidence.` when the packet does not justify a claim.
35
 
36
- ![Fresh public holdout: Quotebound 27B versus the prior bridge model](./standalone_holdout_comparison.svg)
 
 
 
37
 
38
  *On a fresh 36-task public holdout, Quotebound 27B improves task accuracy,
39
- evidence F1, and quote F1 over the prior bridge model. The packet-local
40
- quote normalizer carries the full stack to `0.9093` quote F1.*
41
-
42
- ## At a glance
43
-
44
- - **What it is.** A LoRA adapter on top of
45
- [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2),
46
- trained to answer from closed packets of source text under a strict
47
- answer–evidence–quote–abstain contract.
48
- - **The headline number.** Raw quote F1 on a fresh public holdout roughly
49
- doubles over the prior bridge model (`0.3343` → `0.6815`), meaning much
50
- more of the grounding behavior now lives inside the model itself instead
51
- of in a post-processing layer.
52
- - **Other deltas on the same holdout.** Raw task: `0.8611` → `0.8889`.
53
- Raw strict: `0.2222` → `0.4444`. Raw evidence F1: `0.8815` → `0.9093`.
54
- Zero invalid outputs across every reported evaluation surface.
55
- - **What it isn't.** Not a general chatbot. Not a replacement for the
56
- benchmark-winning hybrid system, which is described below as a separate
57
- result.
58
 
59
- ## Read next
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- - [Technical note](./technical_note_evidence_faithful_reasoning.md) full method, results, and discussion.
62
- - [Frozen benchmark progression chart](./benchmark_progression.svg)
 
 
 
 
 
 
 
63
 
64
  ## Quick start
65
 
66
- Load the 27B base model and attach the adapter:
 
 
 
 
 
67
 
68
  ```python
69
  from peft import PeftModel
@@ -73,36 +85,37 @@ base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
73
  adapter_id = "darcar0/quotebound-27b"
74
 
75
  tokenizer = AutoTokenizer.from_pretrained(base_id)
76
- base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
 
 
 
 
77
  model = PeftModel.from_pretrained(base, adapter_id)
 
78
  ```
79
 
80
- The base is a 27B-parameter model, so load it in whichever quantization
81
- your hardware supports (4-bit `bitsandbytes` works for inference).
 
82
 
83
- ## The contract
84
 
85
- Each task arrives with a closed packet of source text. To count as a
86
- success, the model has to clear four conditions on the same answer:
 
 
 
 
 
 
 
87
 
88
- 1. **Answer correctly** — return the right answer or label for the task.
89
- 2. **Pick the right evidence** — the cited units must be the packet
90
- locations that actually support the answer.
91
- 3. **Quote exact support** — every quote is a verbatim substring of its
92
- cited unit. No paraphrase, no stitching, no ellipsis.
93
- 4. **Abstain when blocked** — if the packet does not justify a claim,
94
- the answer must be exactly `Insufficient evidence.`
95
-
96
- Correctness alone is not credited. The model has been trained to fail
97
- closed when the packet runs out, and to ground every answer it does
98
- return.
99
-
100
- ## Prompt format
101
 
102
  The model is trained for an evidence-first prompt that makes the answer
103
  subordinate to the cited text. A minimal version:
104
 
105
- ```
106
  You are answering from a bounded evidence packet only.
107
 
108
  Work in this order:
@@ -115,11 +128,11 @@ Rules:
115
  - Return valid JSON only.
116
  - Every quote must be a verbatim substring of the cited unit.
117
  - Do not paraphrase, ellipsize, or stitch quotes.
118
- - If the packet is insufficient, the `answer` field must be exactly
119
- `Insufficient evidence.`
120
  ```
121
 
122
- The model then writes a JSON object with this shape:
123
 
124
  ```json
125
  {
@@ -138,9 +151,9 @@ The model then writes a JSON object with this shape:
138
 
139
  ### Fresh 36-task mixed public holdout
140
 
141
- A held-out slice of 18 FEVER verify-claim tasks plus 18 HotpotQA
142
- grounded-QA tasks, drawn from public sources and de-duplicated against
143
- every training, dev, and held-out probe row.
144
 
145
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
146
  |---|---:|---:|---:|---:|
@@ -149,11 +162,17 @@ every training, dev, and held-out probe row.
149
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
150
  | **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
151
 
152
- Quotebound 27B beats the prior bridge model on task accuracy, evidence F1,
153
- and quote F1 in both raw and normalized form, ties normalized strict, and
154
- roughly doubles raw quote F1 at the model level.
 
 
 
 
 
 
155
 
156
- ### Fixed dev triage slice (21 tasks)
157
 
158
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
159
  |---|---:|---:|---:|---:|
@@ -161,69 +180,59 @@ roughly doubles raw quote F1 at the model level.
161
 
162
  ### Untouched 104-task HotpotQA shadow slice
163
 
164
- On a 104-task HotpotQA shadow slice that was never touched during
165
- selection, Quotebound raw improved quote-faithful behavior over the prior
166
- bridge model, and Quotebound plus `deterministic_v3` matched bridge +
167
- `deterministic_v3` at the system level. The surface is reported as a
168
- narrative parity result because the freeze memo does not publish
169
- per-metric cells for it.
170
 
171
  ## Release architecture
172
 
173
- The project ends in two finished results that are reported separately on
174
- purpose. One is the strongest full system on the held-out benchmark; the
175
- other is the strongest standalone model — and the artifact you can
176
- actually download.
177
-
178
- 1. **Quotebound 27B this page.** The adapter above is the strongest
179
- version of the project's evidence-faithful behavior that moved into the
180
- model itself, evaluated across multiple surfaces beyond the held-out
181
- probe.
182
- 2. **The benchmark-winning hybrid system.** A trained bridge checkpoint
183
- plus the `deterministic_v3` packet-local quote normalizer. That stack
184
- is the only configuration that clears every gate of the strict
185
- contract on the frozen held-out probe (`probe_v0`).
186
-
187
- The two results do not collapse into one. The hybrid system is the
188
- benchmark winner. Quotebound 27B is the downloadable model. Perfect
189
- `probe_v0` belongs to the hybrid system, not to the adapter on this page
190
- alone.
191
 
192
  ## Intended use
193
 
194
- Use this release for work that has to stay inside a fixed body of text:
195
 
196
  - bounded document QA with explicit evidence requirements,
197
- - claim verification and grounded QA from closed packets of source text,
198
- - policy, compliance, contract, and internal-document workflows where
199
- each answer has to be justified from the provided text,
200
- - research on evidence-faithful reasoning and abstention behavior.
201
 
202
  ## Limitations
203
 
204
- - The download is the LoRA adapter only the 27B base model is required.
205
- - The `deterministic_v3` packet-local quote normalizer is *not* shipped
206
- here. It lives in the project repository as a separate post-processing
207
- step. Quotebound 27B alone reproduces the raw standalone gains above;
208
- normalized system-level rows require adapter + normalizer.
209
- - Perfect `probe_v0` belongs to the benchmark-winning hybrid system, not
210
- to this adapter alone.
211
- - Specialized for closed-packet reasoning. Behavior outside that setting
212
- open chat, open-domain QA, free-form generation is not
213
- characterized.
214
- - Raw item-level contents of the held-out probe are intentionally not
215
- published with the release; the held-out gate has to stay closed to
216
- remain meaningful.
217
-
218
- ## Citation and references
219
 
220
- - Base model:
221
- [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2)
222
- - Datasets:
223
- [fever/fever](https://huggingface.co/datasets/fever/fever),
224
- [hotpotqa/hotpot_qa](https://huggingface.co/datasets/hotpotqa/hotpot_qa)
225
- - Technical note:
226
- [technical_note_evidence_faithful_reasoning.md](./technical_note_evidence_faithful_reasoning.md)
 
227
 
228
  ```bibtex
229
  @misc{quotebound_27b_2026,
@@ -234,3 +243,13 @@ Use this release for work that has to stay inside a fixed body of text:
234
  url = {https://huggingface.co/darcar0/quotebound-27b}
235
  }
236
  ```
 
 
 
 
 
 
 
 
 
 
 
4
  license: apache-2.0
5
  base_model:
6
  - Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
7
+ base_model_relation: adapter
8
  datasets:
9
  - fever/fever
10
  - hotpotqa/hotpot_qa
 
13
  tags:
14
  - reasoning
15
  - evidence-grounding
16
+ - grounded-qa
17
  - attribution
18
  - fever
19
  - hotpotqa
 
25
 
26
  # Quotebound 27B
27
 
28
+ **A 27B LoRA adapter for evidence-faithful reasoning over closed packets of
29
+ source text.**
30
 
31
+ Quotebound 27B is the standalone model release from the
32
+ Evidence-Faithful Reasoning project. It is trained to read a bounded evidence
33
+ packet, identify the supporting units, copy exact quotes, and abstain with
34
+ `Insufficient evidence.` when the packet does not justify an answer.
 
 
35
 
36
+ The project asks a stricter question than "did the model get the answer right?"
37
+ It asks whether the answer is recoverably grounded in the supplied text.
38
+
39
+ ![Fresh public holdout: Quotebound 27B versus the prior bridge model](./assets/standalone_holdout_comparison.svg)
40
 
41
  *On a fresh 36-task public holdout, Quotebound 27B improves task accuracy,
42
+ evidence F1, and quote F1 over the prior bridge model. The largest raw gain is
43
+ quote faithfulness: `0.3343` -> `0.6815`.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
+ ## Result snapshot
46
+
47
+ | Question | Answer |
48
+ |---|---|
49
+ | What ships here? | A PEFT/LoRA adapter for `Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`. |
50
+ | What changed inside the model? | Raw quote F1 roughly doubled on the fresh public holdout: `0.3343` -> `0.6815`. |
51
+ | Best standalone-system row on that holdout | Quotebound + `deterministic_v3`: task `0.8889`, strict `0.5833`, evidence F1 `0.9093`, quote F1 `0.9093`. |
52
+ | Output reliability | Zero invalid outputs across every reported evaluation surface. |
53
+ | Important boundary | Perfect `probe_v0` belongs to the benchmark-winning hybrid stack, not to this adapter alone. |
54
+
55
+ ## Why this model exists
56
+
57
+ Reasoning-tuned models can sound structured while grounding badly: they may
58
+ answer correctly but cite the wrong evidence, corrupt a quote, or keep going
59
+ when the packet is actually insufficient.
60
 
61
+ Quotebound 27B is trained for a narrower, auditable behavior:
62
+
63
+ 1. choose the smallest sufficient evidence units,
64
+ 2. quote those units verbatim,
65
+ 3. answer only from those units,
66
+ 4. refuse cleanly when the packet runs out.
67
+
68
+ Correctness alone is not credited. The model is meant for settings where a user
69
+ needs the answer and the support to survive inspection together.
70
 
71
  ## Quick start
72
 
73
+ Install the usual Transformers + PEFT stack, then load the base model and
74
+ attach the adapter:
75
+
76
+ ```bash
77
+ pip install -U transformers peft accelerate bitsandbytes
78
+ ```
79
 
80
  ```python
81
  from peft import PeftModel
 
85
  adapter_id = "darcar0/quotebound-27b"
86
 
87
  tokenizer = AutoTokenizer.from_pretrained(base_id)
88
+ base = AutoModelForCausalLM.from_pretrained(
89
+ base_id,
90
+ device_map="auto",
91
+ torch_dtype="auto",
92
+ )
93
  model = PeftModel.from_pretrained(base, adapter_id)
94
+ model.eval()
95
  ```
96
 
97
+ The base is a 27B-parameter model. Use the quantization and serving setup your
98
+ hardware requires; 4-bit loading with `bitsandbytes` is a practical inference
99
+ path on constrained GPUs.
100
 
101
+ ## Model details
102
 
103
+ | Field | Value |
104
+ |---|---|
105
+ | Adapter | `darcar0/quotebound-27b` |
106
+ | Base model | [`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2) |
107
+ | Artifact type | LoRA / PEFT adapter |
108
+ | Primary behavior | Closed-packet grounded QA, claim verification, exact quote attribution, and abstention |
109
+ | Output style | JSON with answer, evidence IDs, verbatim quotes, and short justification |
110
+ | Training sources | Public FEVER-style verify-claim data, public HotpotQA-style grounded-QA data, and project-local packet scaffolding derived from those sources |
111
+ | License | Apache 2.0 |
112
 
113
+ ## Prompt contract
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
  The model is trained for an evidence-first prompt that makes the answer
116
  subordinate to the cited text. A minimal version:
117
 
118
+ ```text
119
  You are answering from a bounded evidence packet only.
120
 
121
  Work in this order:
 
128
  - Return valid JSON only.
129
  - Every quote must be a verbatim substring of the cited unit.
130
  - Do not paraphrase, ellipsize, or stitch quotes.
131
+ - If the packet is insufficient, the answer field must be exactly
132
+ "Insufficient evidence."
133
  ```
134
 
135
+ Expected output shape:
136
 
137
  ```json
138
  {
 
151
 
152
  ### Fresh 36-task mixed public holdout
153
 
154
+ The main standalone comparison uses a fresh 36-task public holdout: 18 FEVER
155
+ verify-claim tasks and 18 HotpotQA grounded-QA tasks. Source rows were
156
+ de-duplicated against training, dev, and `probe_v0` rows.
157
 
158
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
159
  |---|---:|---:|---:|---:|
 
162
  | Bridge + `deterministic_v3` | 0.8611 | 0.5833 | 0.8815 | 0.8815 |
163
  | **Quotebound + `deterministic_v3`** | **0.8889** | **0.5833** | **0.9093** | **0.9093** |
164
 
165
+ How to read this table:
166
+
167
+ - **Raw rows** measure the model outputs before quote repair.
168
+ - **`deterministic_v3` rows** add the packet-local quote normalizer from the
169
+ project repository.
170
+ - Quotebound improves task accuracy, evidence F1, and quote F1 in both raw and
171
+ normalized form; it also ties normalized strict success.
172
+ - The largest model-side gain is raw quote faithfulness, from `0.3343` to
173
+ `0.6815`.
174
 
175
+ ### Fixed dev triage slice
176
 
177
  | Stack | Task | Strict | Evidence F1 | Quote F1 |
178
  |---|---:|---:|---:|---:|
 
180
 
181
  ### Untouched 104-task HotpotQA shadow slice
182
 
183
+ On a 104-task HotpotQA shadow slice that was never touched during selection,
184
+ Quotebound raw improved quote-faithful behavior over the prior bridge model.
185
+ Quotebound plus `deterministic_v3` matched bridge plus `deterministic_v3` at
186
+ the system level. This surface is reported as a narrative parity result because
187
+ the freeze memo does not publish per-metric cells for it.
 
188
 
189
  ## Release architecture
190
 
191
+ The project ends in two finished results that are intentionally reported
192
+ separately:
193
+
194
+ | Result | What it is | What it proves |
195
+ |---|---|---|
196
+ | **Quotebound 27B** | The downloadable LoRA adapter on this page. | More of the evidence-faithful behavior moved into the model itself, with gains across non-`probe_v0` surfaces. |
197
+ | **Benchmark-winning hybrid stack** | A trained bridge checkpoint plus the `deterministic_v3` packet-local quote normalizer. | The full system clears every gate of the strict contract on frozen held-out `probe_v0`. |
198
+
199
+ These are connected, but they are not the same claim. Quotebound 27B is the
200
+ standalone model release. The hybrid stack is the benchmark-facing winner.
201
+ Perfect `probe_v0` belongs to the hybrid stack, not to this adapter alone.
 
 
 
 
 
 
 
202
 
203
  ## Intended use
204
 
205
+ Use this release when answers must stay inside a fixed body of supplied text:
206
 
207
  - bounded document QA with explicit evidence requirements,
208
+ - claim verification over closed packets of source text,
209
+ - policy, compliance, contract, and internal-document review where answers
210
+ need source-text support,
211
+ - research on evidence-faithful reasoning, quote fidelity, and abstention.
212
 
213
  ## Limitations
214
 
215
+ - This is not a general chatbot. Open-domain QA, open chat, and free-form
216
+ generation outside the closed-packet setup are not characterized.
217
+ - The downloadable artifact is the LoRA adapter only; the 27B base model is
218
+ required.
219
+ - `deterministic_v3` is not shipped as part of this model repo. It is a
220
+ separate packet-local post-processing step in the project repository.
221
+ - Perfect `probe_v0` belongs to the benchmark-winning hybrid stack, not to this
222
+ adapter alone.
223
+ - Raw item-level contents of the frozen held-out probe are intentionally not
224
+ published; the held-out gate has to stay closed to remain meaningful.
225
+ - For high-stakes use, treat the model as an evidence-grounding component that
226
+ still requires human review and application-specific validation.
 
 
 
227
 
228
+ ## Read next
229
+
230
+ - [Technical note](./technical_note_evidence_faithful_reasoning.md) - full
231
+ method, release boundary, and result discussion.
232
+ - [Frozen benchmark progression chart](./assets/benchmark_progression.svg)
233
+ - [Fresh holdout comparison chart](./assets/standalone_holdout_comparison.svg)
234
+
235
+ ## Citation
236
 
237
  ```bibtex
238
  @misc{quotebound_27b_2026,
 
243
  url = {https://huggingface.co/darcar0/quotebound-27b}
244
  }
245
  ```
246
+
247
+ ## References
248
+
249
+ - Base model:
250
+ [Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2](https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2)
251
+ - Datasets:
252
+ [fever/fever](https://huggingface.co/datasets/fever/fever),
253
+ [hotpotqa/hotpot_qa](https://huggingface.co/datasets/hotpotqa/hotpot_qa)
254
+ - Technical note:
255
+ [technical_note_evidence_faithful_reasoning.md](./technical_note_evidence_faithful_reasoning.md)