𧬠Midicoth: diffusion-based lossless compression β no neural net, no GPU, no training data
What if reverse diffusion could compress text β without a neural network? Midicoth brings score-based denoising into classical compression. It treats prior smoothing as forward noise and reverses it with Tweedie's formula on a binary tree β 3 denoising steps, James-Stein shrinkage, applied after all model blending. ~2,000 lines of C, single CPU core.
Beats every dictionary compressor we tested: enwik8 (100 MB) β 1.753 bpb (β11.9% vs xz, β15% vs Brotli, β24.5% vs bzip2) alice29.txt β 2.119 bpb (β16.9% vs xz) Outperforms xz, zstd, Brotli, bzip2, gzip on all inputs
PAQ/CMIX still win with hundreds of models + LSTMs. LLM compressors win with pre-trained knowledge. Midicoth closes the gap with pure statistics β no mixer, no gradient descent, just counting. The Tweedie denoising layer adds 2.3β2.7% on every file tested β the most consistent component in the ablation. Adding SSE or logistic mixers made things worse. In the online setting, count-based beats gradient-based. No external dependencies. Fully deterministic. Bit-exact encode/decode. ~60 KB/s throughput. π» Code: https://github.com/robtacconelli/midicoth π Paper: Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation (2603.08771) β Space: robtacconelli/midicoth
If you ever wondered whether diffusion ideas belong in data compression β here's proof they do. β appreciated!
𧬠Midicoth: diffusion-based lossless compression β no neural net, no GPU, no training data
What if reverse diffusion could compress text β without a neural network? Midicoth brings score-based denoising into classical compression. It treats prior smoothing as forward noise and reverses it with Tweedie's formula on a binary tree β 3 denoising steps, James-Stein shrinkage, applied after all model blending. ~2,000 lines of C, single CPU core.
Beats every dictionary compressor we tested: enwik8 (100 MB) β 1.753 bpb (β11.9% vs xz, β15% vs Brotli, β24.5% vs bzip2) alice29.txt β 2.119 bpb (β16.9% vs xz) Outperforms xz, zstd, Brotli, bzip2, gzip on all inputs
PAQ/CMIX still win with hundreds of models + LSTMs. LLM compressors win with pre-trained knowledge. Midicoth closes the gap with pure statistics β no mixer, no gradient descent, just counting. The Tweedie denoising layer adds 2.3β2.7% on every file tested β the most consistent component in the ablation. Adding SSE or logistic mixers made things worse. In the online setting, count-based beats gradient-based. No external dependencies. Fully deterministic. Bit-exact encode/decode. ~60 KB/s throughput. π» Code: https://github.com/robtacconelli/midicoth π Paper: Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation (2603.08771) β Space: robtacconelli/midicoth
If you ever wondered whether diffusion ideas belong in data compression β here's proof they do. β appreciated!
π Nacrith: a 135M model that out-compresses everything on natural language
What if a tiny LM could compress english text better than _every_ compressor out there β classical or neural, small or large?
Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.
What's inside
The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7Γ faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) β the first LLM-based binary compressor we know of.
Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.
Try it, break it, share your results β all feedback welcome. β on the repo appreciated!
Results across all systems we tested: - alice29.txt β 0.918 bpb (β44% vs CMIX, β20% vs ts_zip) β below the 2nd-order Shannon entropy bound - enwik8 (100 MB) β 0.9389 bpb (β8% vs FineZip/LLMZip's 8B model, β15% vs ts_zip) - Unseen text β 0.723 bpb on a doc published after training cutoff β no memorization, 26% better than FineZip/LLMZip on the same model
π Nacrith: a 135M model that out-compresses everything on natural language
What if a tiny LM could compress english text better than _every_ compressor out there β classical or neural, small or large?
Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.
What's inside
The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7Γ faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) β the first LLM-based binary compressor we know of.
Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.
Try it, break it, share your results β all feedback welcome. β on the repo appreciated!
Results across all systems we tested: - alice29.txt β 0.918 bpb (β44% vs CMIX, β20% vs ts_zip) β below the 2nd-order Shannon entropy bound - enwik8 (100 MB) β 0.9389 bpb (β8% vs FineZip/LLMZip's 8B model, β15% vs ts_zip) - Unseen text β 0.723 bpb on a doc published after training cutoff β no memorization, 26% better than FineZip/LLMZip on the same model
π Nacrith: a 135M model that out-compresses everything on natural language
What if a tiny LM could compress english text better than _every_ compressor out there β classical or neural, small or large?
Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.
What's inside
The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7Γ faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) β the first LLM-based binary compressor we know of.
Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.
Try it, break it, share your results β all feedback welcome. β on the repo appreciated!
Results across all systems we tested: - alice29.txt β 0.918 bpb (β44% vs CMIX, β20% vs ts_zip) β below the 2nd-order Shannon entropy bound - enwik8 (100 MB) β 0.9389 bpb (β8% vs FineZip/LLMZip's 8B model, β15% vs ts_zip) - Unseen text β 0.723 bpb on a doc published after training cutoff β no memorization, 26% better than FineZip/LLMZip on the same model