llamita.cpp

Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.

llamita.cpp is a patched fork of PrismML's llama.cpp that enables Bonsai 1-bit models (Q1_0_g128) to compile and run with CUDA 10.2 on the NVIDIA Jetson Nano (SM 5.3 Maxwell, 4 GB RAM).

Results

Model Size on disk RAM used Prompt Generation Board
Bonsai-8B 1.1 GB 2.5 GB 2.1 tok/s 1.1 tok/s Jetson Nano 4GB
Bonsai-4B 546 MB ~1.5 GB 3.6 tok/s 1.6 tok/s Jetson Nano 4GB

An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.

What Was Changed

27 files modified, ~3,200 lines of patches across 7 categories:

  1. C++17 to C++14 β€” if constexpr, std::is_same_v, structured bindings, fold expressions
  2. CUDA 10.2 API stubs β€” nv_bfloat16 type stub, cooperative_groups/reduce.h, CUDA_R_16BF
  3. SM 5.3 Maxwell β€” Warp size macros, MMQ params, flash attention disabled with stubs
  4. ARM NEON GCC 8 β€” Custom struct types for broken vld1q_*_x* intrinsics
  5. Linker β€” -lstdc++fs for std::filesystem
  6. Critical correctness fix β€” binbcast.cu fold expression silently computing nothing
  7. Build system β€” CUDA_STANDARD 14, flash attention template exclusion

The Bug That Broke Everything

During the C++14 port, a fold expression in binbcast.cu was replaced with (void)0. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference β€” and produced complete garbage. The fix was one line.

Links

Credits

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support