robtacconelli (Roberto Tacconelli)

liked a Space 3 days ago

Midicoth - Micro-Diffusion Compression

🗜

1

Lossless data compression via Binary Tree Tweedie Denoising

reactedto their post with 🤯🚀 4 days ago

Post

3653

🧬 Midicoth: diffusion-based lossless compression — no neural net, no GPU, no training data

What if reverse diffusion could compress text — without a neural network?
Midicoth brings score-based denoising into classical compression. It treats prior smoothing as forward noise and reverses it with Tweedie's formula on a binary tree — 3 denoising steps, James-Stein shrinkage, applied after all model blending. ~2,000 lines of C, single CPU core.

Beats every dictionary compressor we tested:
enwik8 (100 MB) → 1.753 bpb (−11.9% vs xz, −15% vs Brotli, −24.5% vs bzip2)
alice29.txt → 2.119 bpb (−16.9% vs xz)
Outperforms xz, zstd, Brotli, bzip2, gzip on all inputs

PAQ/CMIX still win with hundreds of models + LSTMs. LLM compressors win with pre-trained knowledge. Midicoth closes the gap with pure statistics — no mixer, no gradient descent, just counting.
The Tweedie denoising layer adds 2.3–2.7% on every file tested — the most consistent component in the ablation. Adding SSE or logistic mixers made things worse. In the online setting, count-based beats gradient-based.
No external dependencies. Fully deterministic. Bit-exact encode/decode. ~60 KB/s throughput.
💻 Code: https://github.com/robtacconelli/midicoth
📄 Paper: Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation (2603.08771)
⭐ Space: robtacconelli/midicoth

If you ever wondered whether diffusion ideas belong in data compression — here's proof they do. ⭐ appreciated!

posted an update 4 days ago

Post

3653

🧬 Midicoth: diffusion-based lossless compression — no neural net, no GPU, no training data

What if reverse diffusion could compress text — without a neural network?
Midicoth brings score-based denoising into classical compression. It treats prior smoothing as forward noise and reverses it with Tweedie's formula on a binary tree — 3 denoising steps, James-Stein shrinkage, applied after all model blending. ~2,000 lines of C, single CPU core.

Beats every dictionary compressor we tested:
enwik8 (100 MB) → 1.753 bpb (−11.9% vs xz, −15% vs Brotli, −24.5% vs bzip2)
alice29.txt → 2.119 bpb (−16.9% vs xz)
Outperforms xz, zstd, Brotli, bzip2, gzip on all inputs

PAQ/CMIX still win with hundreds of models + LSTMs. LLM compressors win with pre-trained knowledge. Midicoth closes the gap with pure statistics — no mixer, no gradient descent, just counting.
The Tweedie denoising layer adds 2.3–2.7% on every file tested — the most consistent component in the ablation. Adding SSE or logistic mixers made things worse. In the online setting, count-based beats gradient-based.
No external dependencies. Fully deterministic. Bit-exact encode/decode. ~60 KB/s throughput.
💻 Code: https://github.com/robtacconelli/midicoth
📄 Paper: Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation (2603.08771)
⭐ Space: robtacconelli/midicoth

If you ever wondered whether diffusion ideas belong in data compression — here's proof they do. ⭐ appreciated!

updated a Space 4 days ago

Midicoth - Micro-Diffusion Compression

🗜

1

Lossless data compression via Binary Tree Tweedie Denoising

published a Space 4 days ago

Midicoth - Micro-Diffusion Compression

🗜

1

Lossless data compression via Binary Tree Tweedie Denoising

liked a model 6 days ago

mikerubini/boneage

Updated 7 days ago • 1

authored a paper 8 days ago

Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation

Paper • 2603.08771 • Published 11 days ago

submitted a paper to Daily Papers 9 days ago

Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation

Paper • 2603.08771 • Published 11 days ago

New activity in robtacconelli/smollm2-135M-GGUF 24 days ago

Improve dataset card: add metadata, links and usage information

1

#1 opened 24 days ago by

nielsr

updated a dataset 24 days ago

robtacconelli/smollm2-135M-GGUF

Updated 24 days ago • 35

reactedto their post with 🔥 24 days ago

Post

3636

🏆 Nacrith: a 135M model that out-compresses everything on natural language

What if a tiny LM could compress english text better than _every_ compressor out there — classical or neural, small or large?

Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.

What's inside

The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7× faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) — the first LLM-based binary compressor we know of.

Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.

💻 Code: https://github.com/robtacconelli/Nacrith-GPU
⭐ Space: robtacconelli/Nacrith-GPU
📄 Paper: Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding (2602.19626)

Try it, break it, share your results — all feedback welcome. ⭐ on the repo appreciated!

Results across all systems we tested:
- alice29.txt → 0.918 bpb (−44% vs CMIX, −20% vs ts_zip) — below the 2nd-order Shannon entropy bound
- enwik8 (100 MB) → 0.9389 bpb (−8% vs FineZip/LLMZip's 8B model, −15% vs ts_zip)
- Unseen text → 0.723 bpb on a doc published after training cutoff — no memorization, 26% better than FineZip/LLMZip on the same model

SmolLM2-135M by

HuggingFaceTB

1 reply

·

upvoted an article 24 days ago

Article

Nacrith: How a 135M Language Model Became the Best Text Compressor We've Tested

24 days ago

•

1

published an article 24 days ago

Article

Nacrith: How a 135M Language Model Became the Best Text Compressor We've Tested

24 days ago

•

1

reactedto their post with 🚀 24 days ago

Post

3636

🏆 Nacrith: a 135M model that out-compresses everything on natural language

What if a tiny LM could compress english text better than _every_ compressor out there — classical or neural, small or large?

Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.

What's inside

The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7× faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) — the first LLM-based binary compressor we know of.

Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.

💻 Code: https://github.com/robtacconelli/Nacrith-GPU
⭐ Space: robtacconelli/Nacrith-GPU
📄 Paper: Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding (2602.19626)

Try it, break it, share your results — all feedback welcome. ⭐ on the repo appreciated!

Results across all systems we tested:
- alice29.txt → 0.918 bpb (−44% vs CMIX, −20% vs ts_zip) — below the 2nd-order Shannon entropy bound
- enwik8 (100 MB) → 0.9389 bpb (−8% vs FineZip/LLMZip's 8B model, −15% vs ts_zip)
- Unseen text → 0.723 bpb on a doc published after training cutoff — no memorization, 26% better than FineZip/LLMZip on the same model

SmolLM2-135M by

HuggingFaceTB

1 reply

·

New activity in HuggingFaceTB/SmolLM2-135M 24 days ago

Natural language lossless compressor using SmolLM2-135M

👍 1

#9 opened 24 days ago by

robtacconelli

liked a model 24 days ago

HuggingFaceTB/SmolLM2-135M

Text Generation • 0.1B • Updated Feb 6, 2025 • 934k • 171

posted an update 24 days ago

Post

3636

🏆 Nacrith: a 135M model that out-compresses everything on natural language

What if a tiny LM could compress english text better than _every_ compressor out there — classical or neural, small or large?

Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.

What's inside

The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7× faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) — the first LLM-based binary compressor we know of.

Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.

💻 Code: https://github.com/robtacconelli/Nacrith-GPU
⭐ Space: robtacconelli/Nacrith-GPU
📄 Paper: Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding (2602.19626)

Try it, break it, share your results — all feedback welcome. ⭐ on the repo appreciated!

Results across all systems we tested:
- alice29.txt → 0.918 bpb (−44% vs CMIX, −20% vs ts_zip) — below the 2nd-order Shannon entropy bound
- enwik8 (100 MB) → 0.9389 bpb (−8% vs FineZip/LLMZip's 8B model, −15% vs ts_zip)
- Unseen text → 0.723 bpb on a doc published after training cutoff — no memorization, 26% better than FineZip/LLMZip on the same model

SmolLM2-135M by

HuggingFaceTB

1 reply

·

upvoted a paper 24 days ago

Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

Paper • 2602.19626 • Published 25 days ago • 3

submitted a paper to Daily Papers 24 days ago

Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

Paper • 2602.19626 • Published 25 days ago • 3

Roberto Tacconelli PRO

AI & ML interests

Recent Activity

Organizations

Midicoth - Micro-Diffusion Compression

Midicoth - Micro-Diffusion Compression

Midicoth - Micro-Diffusion Compression

mikerubini/boneage

Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation

Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation

Improve dataset card: add metadata, links and usage information

robtacconelli/smollm2-135M-GGUF

Nacrith: How a 135M Language Model Became the Best Text Compressor We've Tested

Nacrith: How a 135M Language Model Became the Best Text Compressor We've Tested

Natural language lossless compressor using SmolLM2-135M

HuggingFaceTB/SmolLM2-135M

Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

Roberto Tacconelli PRO

AI & ML interests

Recent Activity

Organizations

robtacconelli's activity

Midicoth - Micro-Diffusion Compression

Midicoth - Micro-Diffusion Compression

Midicoth - Micro-Diffusion Compression

Improve dataset card: add metadata, links and usage information

Nacrith: How a 135M Language Model Became the Best Text Compressor We've Tested

Nacrith: How a 135M Language Model Became the Best Text Compressor We've Tested

Natural language lossless compressor using SmolLM2-135M