Granite-Docling-258M-GGUF

GGUF format of ibm-granite/granite-docling-258M, a multimodal document OCR model that converts document images to Docling format.

Converted with llama.cpp.

Files

File Quant Size Note
granite-docling-258M-f16.gguf F16 317 MB Full precision
granite-docling-258M-q8_0.gguf Q8_0 170 MB Recommended
mmproj-granite-docling-258M-f16.gguf F16 182 MB Vision encoder (required)

Usage

CLI

llama-mtmd-cli \
    --model granite-docling-258M-q8_0.gguf \
    --mmproj mmproj-granite-docling-258M-f16.gguf \
    --image document.png \
    --n-predict 4096 --ctx-size 8192 --temp 0.0 \
    -p "Convert this page to docling."

Server (OpenAI-compatible API)

llama-server \
    -m granite-docling-258M-q8_0.gguf \
    --mmproj mmproj-granite-docling-258M-f16.gguf \
    --ctx-size 8192 --special --jinja \
    --host 0.0.0.0 --port 8080

Benchmark (CPU only, Q8_0)

CPU Config Long text (4096 tok) Short text (50 tok)
EPYC 9654 (96C) 192 inst x 1t 1.73 img/s 29.4 img/s
EPYC 9654 (16C) 16 inst x 1t 0.67 img/s 8.68 img/s

For this small model, 1 thread per instance with max instances = core count gives best throughput.

Downloads last month
70
GGUF
Model size
0.2B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for padeoe/granite-docling-258M-GGUF

Quantized
(11)
this model