Gradience: Measuring What Your LoRA Adapter Actually Learned

Community Article Published January 27, 2026

How spectral auditing reveals over-provisioned adapters—and why compressing them often improves accuracy.

The Problem Nobody Talks About

You fine-tuned a model with LoRA. You picked rank 32 because... that's what the tutorial used? Or maybe rank 64 because the task seemed hard?

After training, can you answer a simple question: did this adapter actually use the rank I gave it?

If you're like most practitioners, you can't. You have a loss curve that went down. You have eval metrics that look reasonable. But you have no visibility into whether your adapter is efficiently using its capacity—or whether you allocated a 64-seat bus for 8 passengers.

We built Gradience to make this measurable.

The Core Insight: Constrained Updates Generalize Better

This isn't a new idea. It shows up across machine learning theory under different names:

Minimum Description Length: simpler models that fit the data are preferred
PAC-Bayes bounds: generalization error includes a term for how far you moved from your prior
Flat minima: solutions in wide basins of the loss landscape generalize better than sharp minima
Information bottleneck: compression in intermediate representations improves generalization

LoRA already embodies this principle. By restricting weight updates to a low-rank subspace (ΔW = BA where B and A have rank r), you're constraining the adaptation. The question is: how tight should that constraint be?

If the true task-relevant update lives in an 8-dimensional subspace, but you allocated rank 64, you've given the optimizer 56 extra dimensions to potentially memorize noise. The constraint is looser than it needs to be.

Gradience measures this gap.

What Gradience Measures

After training a LoRA adapter, Gradience performs a spectral audit of the learned weight matrices. The key metrics:

Stable Rank

For a matrix M with singular values σ₁ ≥ σ₂ ≥ ... ≥ σᵣ:

stable_rank(M) = ||M||²_F / ||M||²₂ = (Σσᵢ²) / σ₁²

This measures the "effective dimensionality" of the matrix—how many singular directions carry meaningful energy. A rank-64 matrix where most energy concentrates in 8 directions has stable rank ≈ 8.

Energy Rank (k@90%)

The number of singular values needed to capture 90% of the matrix's energy. This gives a concrete compression target: if k@90% = 12, you could likely reduce to rank 16 without losing much.

Utilization

utilization = stable_rank / allocated_rank

If you trained at rank 64 and stable rank is 16, utilization is 0.25. You're using a quarter of your allocated capacity.

Low utilization = compression candidate.

Gain Metrics (v0.7.0)

We recently added magnitude analysis inspired by the mHC paper on training stability:

||ΔW||₂: Operator norm (maximum singular value of the update)
||ΔW||_F / ||W||_F: Relative perturbation compared to base weights
Energy concentration (HHI): A single number capturing whether adaptation is spread evenly across layers or concentrated in a few. High concentration (HHI > 0.25) suggests a few layers are doing most of the work—worth inspecting if something goes wrong.

These catch a different failure mode: an adapter might have reasonable utilization but take disproportionately large steps in a few layers.

The Bench Protocol: From Hypothesis to Evidence

Audit metrics are hypotheses, not conclusions. Low utilization suggests compression is possible—it doesn't guarantee it.

Gradience Bench provides a validation protocol:

1. Train probe adapter (generous rank, e.g., r=64)
2. Audit → get rank suggestions (median, p90, per-layer)
3. Retrain compressed variants at suggested ranks
4. Evaluate on held-out data
5. Aggregate across multiple seeds
6. Apply safety policy (e.g., worst-seed Δ ≥ -2.5%)

The output is an artifact you can attach to a PR: "Compression from r=64 to r=32 validated across 3 seeds with worst-case accuracy drop of -1.5%, within policy threshold."

This is what separates Gradience from "I tried a smaller rank and it seemed fine." You get reproducible evidence, not vibes.

Main Results: Mistral-7B + GSM8K

We ran a full validation on Mistral-7B fine-tuned for mathematical reasoning (GSM8K dataset, exact-match evaluation).

Setup

Probe: r=64, 1200 training steps
Audit suggestion: Utilization indicated r=32 was sufficient
Validation: 3 seeds (42, 123, 456)
Policy: Worst-seed accuracy drop ≥ -2.5%

Results

Variant	Seed 42	Seed 123	Seed 456	Mean Accuracy	Compression
probe (r=64)	27.0%	29.5%	29.5%	28.7%	—
uniform_median (r=32)	31.0%	28.0%	27.5%	28.8%	50%
uniform_p90 (r=32)	36.0%	32.5%	32.5%	33.7%	50%
per_layer	34.0%	32.0%	29.5%	31.8%	~3%

Key Findings

1. The probe was over-provisioned. The r=64 adapter was consistently beaten by compressed variants. More capacity didn't help—it hurt.

2. uniform_p90 won decisively. The conservative audit suggestion (p90 rather than median) paired with uniform allocation across layers yielded:

50% parameter reduction
+5 points mean accuracy over probe
Positive improvement on all three seeds

3. Compression acted as regularization. This wasn't just "maintained accuracy with fewer parameters." Accuracy improved. The tighter constraint prevented the model from fitting noise in the fine-tuning data.

4. The median suggestion was too aggressive. It passed policy on all seeds but showed higher variance, including accuracy drops on 2/3 seeds. The p90 suggestion provided a better margin.

The Ablation: Does Placement Matter?

We ran an additional experiment to understand whether where you allocate rank matters, or just how much total rank you allocate.

Design

Three per-layer configurations with the same total parameter budget:

Variant	Description
per_layer	Audit-guided rank allocation (more rank where utilization is higher)
per_layer_shuffled	Same rank distribution, randomly assigned to layers
uniform	Same rank everywhere

If guided placement helps, per_layer should beat shuffled. If only total capacity matters, they should be similar.

Results

Variant	Seed 42	Seed 123	Seed 456	Mean Δ vs Probe	Std
per_layer	+7.0	+2.5	+0.0	+3.2	3.5
per_layer_shuffled	+8.0	+0.0	-0.5	+2.5	4.7

Interpretation

Guided placement shows a small advantage (+0.7 mean) and lower variance (std 3.5 vs 4.7). But with only 3 seeds, this isn't statistically conclusive.

The practical upshot: uniform allocation is good enough for most cases. Per-layer allocation may reduce variance, but it's not a clear win on mean accuracy. Use uniform_p90 unless you have specific reasons to allocate per-layer.

Cross-Scale Validation

To ensure this isn't a Mistral-specific or GSM8K-specific finding, we validated on a different scale and task:

Model	Task	Compression	Accuracy Impact
DistilBERT (66M)	SST-2 (sentiment)	61%	Within tolerance
Mistral-7B (7B)	GSM8K (math)	50%	+5 points mean

The protocol transfers across:

Two orders of magnitude in model size
Encoder vs. decoder architecture
Classification vs. generation task

Comparison to Standard Guidance

The common advice for LoRA rank is: "Start with small values (4-8) and scale up if needed."

This is reasonable when you have no visibility into adapter efficiency. But it has problems:

"Start small" is unfalsifiable from below. If you train at r=8 and get 80% accuracy, you don't know if r=4 would have worked or if r=8 is leaving performance on the table.

"Scale up if needed" provides no direction. You're searching blindly.

Training loss ≠ structural efficiency. A rank-64 and rank-8 adapter can both achieve low training loss on the same task. Loss tells you optimization succeeded; it doesn't tell you if you over-allocated.

Gradience inverts the workflow:

Standard Approach	Gradience Approach
Start small, scale up if it fails	Start generous, measure what you used, compress
Search blindly	Audit → targeted experiments
Hope you stop at the right rank	Measure utilization directly
No structural insight	Spectral analysis explains behavior

An over-provisioned adapter reveals its inefficiency through low utilization. You can see that you're running a 64-seat bus with 16 passengers. Then you test the compression and verify it holds.

Connection to Broader Research

Gradience fits into a family of methods exploring capacity under constraint.

The recent mHC paper from DeepSeek addresses a related problem for residual connections: unconstrained learnable mixing matrices can cause training instability at scale. Their solution is to project the mixing matrices onto a stability-preserving manifold (doubly stochastic matrices).

The parallel:

Project	Domain	Pathology	Solution
mHC	Residual connections	Training instability (exploding/vanishing signals)	Manifold projection during training
Gradience	LoRA adapters	Poor generalization (memorization, overfitting)	Post-hoc audit + compression

Both projects share a core insight: more capacity without appropriate constraint is not a win. mHC enforces constraints during training; Gradience measures them after training and feeds that information back into the next run.

The gain metrics we added in v0.7.0 (operator norm, energy concentration) are directly inspired by mHC's stability diagnostics.

Practical Recommendations

Based on our validation, here's what we recommend:

For most fine-tuning tasks:

Train a probe adapter at generous rank (r=32 or r=64)
Run the audit: gradience audit --peft-dir ./adapter
Look at utilization: If < 0.4, you likely have compression headroom
Use the p90 rank suggestion with uniform allocation
Validate on held-out data before deploying

When to use per-layer allocation:

If you have strong prior belief that certain layers need more capacity
If you're optimizing for variance reduction rather than mean accuracy
If you're in a research context exploring adapter structure

Red flags in the audit:

Utilization < 0.2: Severely over-provisioned; likely compressible by 50%+
High energy concentration (HHI > 0.25): A few layers dominate; inspect those layers
Large relative perturbation in specific layers: Potential instability; consider regularization

Installation and Usage

pip install gradience

Basic audit:

gradience audit --peft-dir ./your-adapter

HuggingFace Trainer integration:

from gradience import GradienceCallback

trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[GradienceCallback()]
)

Full Bench protocol:

python -m gradience.bench.run_bench --config your_config.yaml
python -m gradience.bench.aggregate --run-dir ./bench_runs/your_experiment

What Gradience Doesn't Do

It's not AutoML. It won't tune your learning rate or training schedule.

It's not an oracle. Audit suggestions are hypotheses. Evaluation is the only arbiter.

It's not a replacement for held-out evaluation. It accelerates the hypothesis → test loop. It doesn't skip it.

It can't rescue a fundamentally broken setup. If your task is misspecified or your data is bad, compression won't save you.

Limitations and Caveats

QLoRA complicates interpretation. Under quantization, adapters may compensate for quantization error, inflating utilization. We flag quantized runs and recommend cautious interpretation.

Task-specific compression ratios. The 50% we achieved on GSM8K isn't universal. Some tasks may require higher rank; some may compress more. Always validate.

Seed variance is real. Our Mistral results show meaningful variance across seeds (e.g., uniform_median went from +4 to -2 points). Always run multiple seeds for important decisions.

Per-layer allocation isn't magic. Our ablation suggests uniform allocation is usually sufficient. Don't add complexity without evidence it helps.

What's Next

We're continuing to extend validation coverage:

Instruction tuning: Testing on Alpaca-style data (the dominant LoRA use case)
Code generation: HumanEval validation (different task structure)
Failure boundary mapping: Finding where compression clearly fails (how aggressive is too aggressive?)
Guard experiments: Testing whether norm constraints during training reduce pathological runs

The goal is a tool that's useful across the range of fine-tuning scenarios practitioners actually encounter—not just the benchmarks that are easy to publish.

Citation

@software{gradience2025,
  title = {Gradience: Spectral Auditing for LoRA Compression},
  author = {Nanney, John T.},
  year = {2025},
  url = {https://github.com/johntnanney/gradience}
}

Acknowledgments

The theoretical framing draws on work in MDL, PAC-Bayes generalization bounds, and the information bottleneck principle. The gain metrics were inspired by DeepSeek's mHC paper on training stability. Thanks to the HuggingFace PEFT team for making LoRA accessible to practitioners.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Gradience: Measuring What Your LoRA Adapter Actually Learned

The Problem Nobody Talks About

The Core Insight: Constrained Updates Generalize Better

What Gradience Measures

Stable Rank

Energy Rank (k@90%)

Utilization

Gain Metrics (v0.7.0)

The Bench Protocol: From Hypothesis to Evidence

Main Results: Mistral-7B + GSM8K

Setup

Results

Key Findings

The Ablation: Does Placement Matter?

Design

Results

Interpretation

Cross-Scale Validation

Comparison to Standard Guidance

Connection to Broader Research

Practical Recommendations

For most fine-tuning tasks:

When to use per-layer allocation:

Red flags in the audit:

Installation and Usage

Basic audit:

HuggingFace Trainer integration:

Full Bench protocol:

What Gradience Doesn't Do

Limitations and Caveats

What's Next

Links

Citation

Acknowledgments

Community