This repository contains 2-bit quantized LLaMA-v1 models in GGUF format for use with llama.cpp. All tensors are quantized with Q2_K, except for output.weight, which is Q6_K, and, in the case of LLaMA-v2-70B, attn_v, which is Q4_K. The quantized models differ from the standard llama.cpp 2-bit quantization in two ways:

These are actual 2-bit quantized models instead of the mostly 3-bit quantization provided by the standard llama.cpp Q2_K quantization method
The models were prepared with a refined (but not yet published) k-quants quantization approach

The table shows Wikitext perplexities for a context length of 2048 tokens computed with these models using llama.cpp

Model	Perplexity
7B	6.4023
13B	5.3967
30B	4.5065
65B	3.9136

Downloads last month: 39

GGUF

Model size

13B params

Architecture

llama

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support