llamita.cpp

Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.

llamita.cpp is a patched fork of PrismML's llama.cpp that enables Bonsai 1-bit models (Q1_0_g128) to compile and run with CUDA 10.2 on the NVIDIA Jetson Nano (SM 5.3 Maxwell, 4 GB RAM).

Results

Model	Size on disk	RAM used	Prompt	Generation	Board
Bonsai-8B	1.1 GB	2.5 GB	2.1 tok/s	1.1 tok/s	Jetson Nano 4GB
Bonsai-4B	546 MB	~1.5 GB	3.6 tok/s	1.6 tok/s	Jetson Nano 4GB

An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.

What Was Changed

27 files modified, ~3,200 lines of patches across 7 categories:

C++17 to C++14 — if constexpr, std::is_same_v, structured bindings, fold expressions
CUDA 10.2 API stubs — nv_bfloat16 type stub, cooperative_groups/reduce.h, CUDA_R_16BF
SM 5.3 Maxwell — Warp size macros, MMQ params, flash attention disabled with stubs
ARM NEON GCC 8 — Custom struct types for broken vld1q_*_x* intrinsics
Linker — -lstdc++fs for std::filesystem
Critical correctness fix — binbcast.cu fold expression silently computing nothing
Build system — CUDA_STANDARD 14, flash attention template exclusion

The Bug That Broke Everything

During the C++14 port, a fold expression in binbcast.cu was replaced with (void)0. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference — and produced complete garbage. The fix was one line.

Credits

ggml-org/llama.cpp — Original llama.cpp (MIT)
PrismML-Eng/llama.cpp — Q1_0_g128 support (MIT)
PrismML Bonsai models — 1-bit LLMs (Apache 2.0)

Downloads last month: -; Downloads are not tracked for this model. How to track

coverblew
/

llamita.cpp

llamita.cpp

Results

What Was Changed

The Bug That Broke Everything

Links

Credits