llamita.cpp
Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.
llamita.cpp is a patched fork of PrismML's llama.cpp that enables Bonsai 1-bit models (Q1_0_g128) to compile and run with CUDA 10.2 on the NVIDIA Jetson Nano (SM 5.3 Maxwell, 4 GB RAM).
Results
| Model | Size on disk | RAM used | Prompt | Generation | Board |
|---|---|---|---|---|---|
| Bonsai-8B | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB |
| Bonsai-4B | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB |
An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.
What Was Changed
27 files modified, ~3,200 lines of patches across 7 categories:
- C++17 to C++14 β
if constexpr,std::is_same_v, structured bindings, fold expressions - CUDA 10.2 API stubs β
nv_bfloat16type stub,cooperative_groups/reduce.h,CUDA_R_16BF - SM 5.3 Maxwell β Warp size macros, MMQ params, flash attention disabled with stubs
- ARM NEON GCC 8 β Custom struct types for broken
vld1q_*_x*intrinsics - Linker β
-lstdc++fsforstd::filesystem - Critical correctness fix β
binbcast.cufold expression silently computing nothing - Build system β
CUDA_STANDARD 14, flash attention template exclusion
The Bug That Broke Everything
During the C++14 port, a fold expression in binbcast.cu was replaced with (void)0. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference β and produced complete garbage. The fix was one line.
Links
- GitHub: coverblew/llamita.cpp
- Blog post: An 8B Model on a $99 Board
- Patch documentation: PATCHES.md
- Build guide: BUILD-JETSON.md
- Benchmarks: jetson-nano-4gb.md
Credits
- ggml-org/llama.cpp β Original llama.cpp (MIT)
- PrismML-Eng/llama.cpp β Q1_0_g128 support (MIT)
- PrismML Bonsai models β 1-bit LLMs (Apache 2.0)