Qwen2.5-VL-7B HQQ 4-bit Clean Model
This is a clean HQQ 4-bit quantized version of Qwen2.5-VL-7B-Instruct with no meta tensor issues.
π― Model Overview
- Base Model: Qwen/Qwen2.5-VL-7B-Instruct
- Quantization Method: HQQ (Half-Quadratic Quantization)
- Precision: 4-bit
- Compatible Backends: gemlite, torchao_int4, bitblas, marlin
π Quantization Configuration
HqqConfig(
nbits=4, # 4-bit precision
group_size=64, # Quantization group size
axis=1 # Quantization axis
)
Quantization Method:
- Quantized during model loading via
quantization_configparameter - Uses transformers-native HQQ integration for maximum compatibility
- Meta tensor prevention with
_fast_init=False - Safe serialization format for reliable loading
π¬ About HQQ (Half-Quadratic Quantization)
HQQ is a fast, calibration-free quantization method that offers several advantages:
Key Features
- β‘ Fast: Quantizes large models in minutes (50x faster than GPTQ)
- π― No Calibration Data Required: Works without calibration datasets
- π§ Weight-Focused: Minimizes errors between original and dequantized weights
- π Universal: Compatible with any model modality (LLMs, Vision, Multimodal)
- π Optimized Inference: Supports fused kernels (torchao, marlin) for faster inference
Technical Highlights
- Robust Optimization: Uses Half-Quadratic solver with non-convex lp<1-norm for optimal quantization parameters
- Linear Dequantization: Compatible with optimized CUDA/Triton kernels
- Quality Preservation: Maintains model performance while drastically reducing size
- PEFT Compatible: Supports fine-tuning with Parameter-Efficient Fine-Tuning methods
Resources
π» Usage
Basic Setup
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from hqq.utils.patching import prepare_for_inference
# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
torch_dtype=torch.float16,
device_map="cuda:0",
trust_remote_code=True
)
# Apply HQQ patching with backend
prepare_for_inference(model, backend='gemlite', verbose=True)
# Load processor
processor = AutoProcessor.from_pretrained(
"LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
trust_remote_code=True
)
Inference Example
# Prepare messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")
# Generate
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output[0])
π§ Quantization Process
This model was quantized using the following approach:
- Load with Quantization Config: Model quantized during loading using transformers' native HQQ integration
- Meta Tensor Prevention: Used
_fast_init=Falseto avoid meta tensor issues - Safe Serialization: Saved using
safe_serialization=Truefor reliable model loading
from transformers import Qwen2_5_VLForConditionalGeneration, HqqConfig
quant_config = HqqConfig(nbits=4, group_size=64, axis=1)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.float16,
device_map='cuda:0',
quantization_config=quant_config,
trust_remote_code=True,
low_cpu_mem_usage=True,
_fast_init=False,
).eval()
model.save_pretrained("output_dir", safe_serialization=True)
Supported Backends
gemlite: Recommended for best performance on consumer GPUstorchao_int4: PyTorch native implementationbitblas: Optimized for specific hardwaremarlin: High-performance kernel for inference
π― Applications
This quantized model excels at:
- Document Understanding: Extract structured data from documents
- OCR and Text Extraction: Read and understand text in images
- Invoice Processing: Parse invoices, receipts, and financial documents
- Visual Question Answering: Answer questions about image content
- Multimodal Tasks: General vision-language understanding
π Related Resources
- Base Model: Qwen/Qwen2.5-VL-7B-Instruct
- HQQ Library: mobiusml/hqq
- Qwen VL Utils: qwen-vl-utils
π License
Apache 2.0 (inherited from base model)
π Acknowledgments
- Qwen team for the excellent base model
- MobiusML for the HQQ quantization method
- Hugging Face for transformers integration
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for LumenAI/qwen2.5-vl-7b-hqq-4bit-clean
Base model
Qwen/Qwen2.5-VL-7B-Instruct