Gradience: Measuring What Your LoRA Adapter Actually Learned
How spectral auditing reveals over-provisioned adapters—and why compressing them often improves accuracy.
The Problem Nobody Talks About
You fine-tuned a model with LoRA. You picked rank 32 because... that's what the tutorial used? Or maybe rank 64 because the task seemed hard?
After training, can you answer a simple question: did this adapter actually use the rank I gave it?
If you're like most practitioners, you can't. You have a loss curve that went down. You have eval metrics that look reasonable. But you have no visibility into whether your adapter is efficiently using its capacity—or whether you allocated a 64-seat bus for 8 passengers.
We built Gradience to make this measurable.
The Core Insight: Constrained Updates Generalize Better
This isn't a new idea. It shows up across machine learning theory under different names:
- Minimum Description Length: simpler models that fit the data are preferred
- PAC-Bayes bounds: generalization error includes a term for how far you moved from your prior
- Flat minima: solutions in wide basins of the loss landscape generalize better than sharp minima
- Information bottleneck: compression in intermediate representations improves generalization
LoRA already embodies this principle. By restricting weight updates to a low-rank subspace (ΔW = BA where B and A have rank r), you're constraining the adaptation. The question is: how tight should that constraint be?
If the true task-relevant update lives in an 8-dimensional subspace, but you allocated rank 64, you've given the optimizer 56 extra dimensions to potentially memorize noise. The constraint is looser than it needs to be.
Gradience measures this gap.
What Gradience Measures
After training a LoRA adapter, Gradience performs a spectral audit of the learned weight matrices. The key metrics:
Stable Rank
For a matrix M with singular values σ₁ ≥ σ₂ ≥ ... ≥ σᵣ:
stable_rank(M) = ||M||²_F / ||M||²₂ = (Σσᵢ²) / σ₁²
This measures the "effective dimensionality" of the matrix—how many singular directions carry meaningful energy. A rank-64 matrix where most energy concentrates in 8 directions has stable rank ≈ 8.
Energy Rank (k@90%)
The number of singular values needed to capture 90% of the matrix's energy. This gives a concrete compression target: if k@90% = 12, you could likely reduce to rank 16 without losing much.
Utilization
utilization = stable_rank / allocated_rank
If you trained at rank 64 and stable rank is 16, utilization is 0.25. You're using a quarter of your allocated capacity.
Low utilization = compression candidate.
Gain Metrics (v0.7.0)
We recently added magnitude analysis inspired by the mHC paper on training stability:
- ||ΔW||₂: Operator norm (maximum singular value of the update)
- ||ΔW||_F / ||W||_F: Relative perturbation compared to base weights
- Energy concentration (HHI): A single number capturing whether adaptation is spread evenly across layers or concentrated in a few. High concentration (HHI > 0.25) suggests a few layers are doing most of the work—worth inspecting if something goes wrong.
These catch a different failure mode: an adapter might have reasonable utilization but take disproportionately large steps in a few layers.
The Bench Protocol: From Hypothesis to Evidence
Audit metrics are hypotheses, not conclusions. Low utilization suggests compression is possible—it doesn't guarantee it.
Gradience Bench provides a validation protocol:
1. Train probe adapter (generous rank, e.g., r=64)
2. Audit → get rank suggestions (median, p90, per-layer)
3. Retrain compressed variants at suggested ranks
4. Evaluate on held-out data
5. Aggregate across multiple seeds
6. Apply safety policy (e.g., worst-seed Δ ≥ -2.5%)
The output is an artifact you can attach to a PR: "Compression from r=64 to r=32 validated across 3 seeds with worst-case accuracy drop of -1.5%, within policy threshold."
This is what separates Gradience from "I tried a smaller rank and it seemed fine." You get reproducible evidence, not vibes.
Main Results: Mistral-7B + GSM8K
We ran a full validation on Mistral-7B fine-tuned for mathematical reasoning (GSM8K dataset, exact-match evaluation).
Setup
- Probe: r=64, 1200 training steps
- Audit suggestion: Utilization indicated r=32 was sufficient
- Validation: 3 seeds (42, 123, 456)
- Policy: Worst-seed accuracy drop ≥ -2.5%
Results
| Variant | Seed 42 | Seed 123 | Seed 456 | Mean Accuracy | Compression |
|---|---|---|---|---|---|
| probe (r=64) | 27.0% | 29.5% | 29.5% | 28.7% | — |
| uniform_median (r=32) | 31.0% | 28.0% | 27.5% | 28.8% | 50% |
| uniform_p90 (r=32) | 36.0% | 32.5% | 32.5% | 33.7% | 50% |
| per_layer | 34.0% | 32.0% | 29.5% | 31.8% | ~3% |
Key Findings
1. The probe was over-provisioned. The r=64 adapter was consistently beaten by compressed variants. More capacity didn't help—it hurt.
2. uniform_p90 won decisively. The conservative audit suggestion (p90 rather than median) paired with uniform allocation across layers yielded:
- 50% parameter reduction
- +5 points mean accuracy over probe
- Positive improvement on all three seeds
3. Compression acted as regularization. This wasn't just "maintained accuracy with fewer parameters." Accuracy improved. The tighter constraint prevented the model from fitting noise in the fine-tuning data.
4. The median suggestion was too aggressive. It passed policy on all seeds but showed higher variance, including accuracy drops on 2/3 seeds. The p90 suggestion provided a better margin.
The Ablation: Does Placement Matter?
We ran an additional experiment to understand whether where you allocate rank matters, or just how much total rank you allocate.
Design
Three per-layer configurations with the same total parameter budget:
| Variant | Description |
|---|---|
| per_layer | Audit-guided rank allocation (more rank where utilization is higher) |
| per_layer_shuffled | Same rank distribution, randomly assigned to layers |
| uniform | Same rank everywhere |
If guided placement helps, per_layer should beat shuffled. If only total capacity matters, they should be similar.
Results
| Variant | Seed 42 | Seed 123 | Seed 456 | Mean Δ vs Probe | Std |
|---|---|---|---|---|---|
| per_layer | +7.0 | +2.5 | +0.0 | +3.2 | 3.5 |
| per_layer_shuffled | +8.0 | +0.0 | -0.5 | +2.5 | 4.7 |
Interpretation
Guided placement shows a small advantage (+0.7 mean) and lower variance (std 3.5 vs 4.7). But with only 3 seeds, this isn't statistically conclusive.
The practical upshot: uniform allocation is good enough for most cases. Per-layer allocation may reduce variance, but it's not a clear win on mean accuracy. Use uniform_p90 unless you have specific reasons to allocate per-layer.
Cross-Scale Validation
To ensure this isn't a Mistral-specific or GSM8K-specific finding, we validated on a different scale and task:
| Model | Task | Compression | Accuracy Impact |
|---|---|---|---|
| DistilBERT (66M) | SST-2 (sentiment) | 61% | Within tolerance |
| Mistral-7B (7B) | GSM8K (math) | 50% | +5 points mean |
The protocol transfers across:
- Two orders of magnitude in model size
- Encoder vs. decoder architecture
- Classification vs. generation task
Comparison to Standard Guidance
The common advice for LoRA rank is: "Start with small values (4-8) and scale up if needed."
This is reasonable when you have no visibility into adapter efficiency. But it has problems:
"Start small" is unfalsifiable from below. If you train at r=8 and get 80% accuracy, you don't know if r=4 would have worked or if r=8 is leaving performance on the table.
"Scale up if needed" provides no direction. You're searching blindly.
Training loss ≠ structural efficiency. A rank-64 and rank-8 adapter can both achieve low training loss on the same task. Loss tells you optimization succeeded; it doesn't tell you if you over-allocated.
Gradience inverts the workflow:
| Standard Approach | Gradience Approach |
|---|---|
| Start small, scale up if it fails | Start generous, measure what you used, compress |
| Search blindly | Audit → targeted experiments |
| Hope you stop at the right rank | Measure utilization directly |
| No structural insight | Spectral analysis explains behavior |
An over-provisioned adapter reveals its inefficiency through low utilization. You can see that you're running a 64-seat bus with 16 passengers. Then you test the compression and verify it holds.
Connection to Broader Research
Gradience fits into a family of methods exploring capacity under constraint.
The recent mHC paper from DeepSeek addresses a related problem for residual connections: unconstrained learnable mixing matrices can cause training instability at scale. Their solution is to project the mixing matrices onto a stability-preserving manifold (doubly stochastic matrices).
The parallel:
| Project | Domain | Pathology | Solution |
|---|---|---|---|
| mHC | Residual connections | Training instability (exploding/vanishing signals) | Manifold projection during training |
| Gradience | LoRA adapters | Poor generalization (memorization, overfitting) | Post-hoc audit + compression |
Both projects share a core insight: more capacity without appropriate constraint is not a win. mHC enforces constraints during training; Gradience measures them after training and feeds that information back into the next run.
The gain metrics we added in v0.7.0 (operator norm, energy concentration) are directly inspired by mHC's stability diagnostics.
Practical Recommendations
Based on our validation, here's what we recommend:
For most fine-tuning tasks:
- Train a probe adapter at generous rank (r=32 or r=64)
- Run the audit:
gradience audit --peft-dir ./adapter - Look at utilization: If < 0.4, you likely have compression headroom
- Use the p90 rank suggestion with uniform allocation
- Validate on held-out data before deploying
When to use per-layer allocation:
- If you have strong prior belief that certain layers need more capacity
- If you're optimizing for variance reduction rather than mean accuracy
- If you're in a research context exploring adapter structure
Red flags in the audit:
- Utilization < 0.2: Severely over-provisioned; likely compressible by 50%+
- High energy concentration (HHI > 0.25): A few layers dominate; inspect those layers
- Large relative perturbation in specific layers: Potential instability; consider regularization
Installation and Usage
pip install gradience
Basic audit:
gradience audit --peft-dir ./your-adapter
HuggingFace Trainer integration:
from gradience import GradienceCallback
trainer = Trainer(
model=model,
args=training_args,
callbacks=[GradienceCallback()]
)
Full Bench protocol:
python -m gradience.bench.run_bench --config your_config.yaml
python -m gradience.bench.aggregate --run-dir ./bench_runs/your_experiment
What Gradience Doesn't Do
It's not AutoML. It won't tune your learning rate or training schedule.
It's not an oracle. Audit suggestions are hypotheses. Evaluation is the only arbiter.
It's not a replacement for held-out evaluation. It accelerates the hypothesis → test loop. It doesn't skip it.
It can't rescue a fundamentally broken setup. If your task is misspecified or your data is bad, compression won't save you.
Limitations and Caveats
QLoRA complicates interpretation. Under quantization, adapters may compensate for quantization error, inflating utilization. We flag quantized runs and recommend cautious interpretation.
Task-specific compression ratios. The 50% we achieved on GSM8K isn't universal. Some tasks may require higher rank; some may compress more. Always validate.
Seed variance is real. Our Mistral results show meaningful variance across seeds (e.g., uniform_median went from +4 to -2 points). Always run multiple seeds for important decisions.
Per-layer allocation isn't magic. Our ablation suggests uniform allocation is usually sufficient. Don't add complexity without evidence it helps.
What's Next
We're continuing to extend validation coverage:
- Instruction tuning: Testing on Alpaca-style data (the dominant LoRA use case)
- Code generation: HumanEval validation (different task structure)
- Failure boundary mapping: Finding where compression clearly fails (how aggressive is too aggressive?)
- Guard experiments: Testing whether norm constraints during training reduce pathological runs
The goal is a tool that's useful across the range of fine-tuning scenarios practitioners actually encounter—not just the benchmarks that are easy to publish.
Links
- Code: github.com/johntnanney/gradience
- Documentation: See THEORY.md and METRICS_GUIDE.md in the repo
- License: Apache 2.0
If you're fine-tuning with LoRA and wondering whether your rank is right, Gradience gives you a way to check—and a protocol to validate compression before deploying it.
Not vibes. Evidence.
Citation
@software{gradience2025,
title = {Gradience: Spectral Auditing for LoRA Compression},
author = {Nanney, John T.},
year = {2025},
url = {https://github.com/johntnanney/gradience}
}
Acknowledgments
The theoretical framing draws on work in MDL, PAC-Bayes generalization bounds, and the information bottleneck principle. The gain metrics were inspired by DeepSeek's mHC paper on training stability. Thanks to the HuggingFace PEFT team for making LoRA accessible to practitioners.
