Granite-Docling-258M-GGUF

GGUF format of ibm-granite/granite-docling-258M, a multimodal document OCR model that converts document images to Docling format.

Converted with llama.cpp.

Files

File	Quant	Size	Note
`granite-docling-258M-f16.gguf`	F16	317 MB	Full precision
`granite-docling-258M-q8_0.gguf`	Q8_0	170 MB	Recommended
`mmproj-granite-docling-258M-f16.gguf`	F16	182 MB	Vision encoder (required)

Usage

CLI

llama-mtmd-cli \
    --model granite-docling-258M-q8_0.gguf \
    --mmproj mmproj-granite-docling-258M-f16.gguf \
    --image document.png \
    --n-predict 4096 --ctx-size 8192 --temp 0.0 \
    -p "Convert this page to docling."

Server (OpenAI-compatible API)

llama-server \
    -m granite-docling-258M-q8_0.gguf \
    --mmproj mmproj-granite-docling-258M-f16.gguf \
    --ctx-size 8192 --special --jinja \
    --host 0.0.0.0 --port 8080

Benchmark (CPU only, Q8_0)

CPU	Config	Long text (4096 tok)	Short text (50 tok)
EPYC 9654 (96C)	192 inst x 1t	1.73 img/s	29.4 img/s
EPYC 9654 (16C)	16 inst x 1t	0.67 img/s	8.68 img/s

For this small model, 1 thread per instance with max instances = core count gives best throughput.

Downloads last month: 70

GGUF

Model size

0.2B params

Architecture

llama

Hardware compatibility

8-bit

16-bit

Model tree for padeoe/granite-docling-258M-GGUF

Base model

ibm-granite/granite-docling-258M

Quantized

(11)

this model