Qwen2.5-VL-7B HQQ 4-bit Clean Model

This is a clean HQQ 4-bit quantized version of Qwen2.5-VL-7B-Instruct with no meta tensor issues.

🎯 Model Overview

  • Base Model: Qwen/Qwen2.5-VL-7B-Instruct
  • Quantization Method: HQQ (Half-Quadratic Quantization)
  • Precision: 4-bit
  • Compatible Backends: gemlite, torchao_int4, bitblas, marlin

πŸ“Š Quantization Configuration

HqqConfig(
    nbits=4,           # 4-bit precision
    group_size=64,     # Quantization group size
    axis=1             # Quantization axis
)

Quantization Method:

  • Quantized during model loading via quantization_config parameter
  • Uses transformers-native HQQ integration for maximum compatibility
  • Meta tensor prevention with _fast_init=False
  • Safe serialization format for reliable loading

πŸ”¬ About HQQ (Half-Quadratic Quantization)

HQQ is a fast, calibration-free quantization method that offers several advantages:

Key Features

  • ⚑ Fast: Quantizes large models in minutes (50x faster than GPTQ)
  • 🎯 No Calibration Data Required: Works without calibration datasets
  • πŸ”§ Weight-Focused: Minimizes errors between original and dequantized weights
  • 🌐 Universal: Compatible with any model modality (LLMs, Vision, Multimodal)
  • πŸš€ Optimized Inference: Supports fused kernels (torchao, marlin) for faster inference

Technical Highlights

  1. Robust Optimization: Uses Half-Quadratic solver with non-convex lp<1-norm for optimal quantization parameters
  2. Linear Dequantization: Compatible with optimized CUDA/Triton kernels
  3. Quality Preservation: Maintains model performance while drastically reducing size
  4. PEFT Compatible: Supports fine-tuning with Parameter-Efficient Fine-Tuning methods

Resources

πŸ’» Usage

Basic Setup

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from hqq.utils.patching import prepare_for_inference

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
    torch_dtype=torch.float16,
    device_map="cuda:0",
    trust_remote_code=True
)

# Apply HQQ patching with backend
prepare_for_inference(model, backend='gemlite', verbose=True)

# Load processor
processor = AutoProcessor.from_pretrained(
    "LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
    trust_remote_code=True
)

Inference Example

# Prepare messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")

# Generate
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512)

generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output[0])

πŸ”§ Quantization Process

This model was quantized using the following approach:

  1. Load with Quantization Config: Model quantized during loading using transformers' native HQQ integration
  2. Meta Tensor Prevention: Used _fast_init=False to avoid meta tensor issues
  3. Safe Serialization: Saved using safe_serialization=True for reliable model loading
from transformers import Qwen2_5_VLForConditionalGeneration, HqqConfig

quant_config = HqqConfig(nbits=4, group_size=64, axis=1)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.float16,
    device_map='cuda:0',
    quantization_config=quant_config,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    _fast_init=False,
).eval()

model.save_pretrained("output_dir", safe_serialization=True)

Supported Backends

  • gemlite: Recommended for best performance on consumer GPUs
  • torchao_int4: PyTorch native implementation
  • bitblas: Optimized for specific hardware
  • marlin: High-performance kernel for inference

🎯 Applications

This quantized model excels at:

  • Document Understanding: Extract structured data from documents
  • OCR and Text Extraction: Read and understand text in images
  • Invoice Processing: Parse invoices, receipts, and financial documents
  • Visual Question Answering: Answer questions about image content
  • Multimodal Tasks: General vision-language understanding

πŸ”— Related Resources

πŸ“œ License

Apache 2.0 (inherited from base model)

πŸ™ Acknowledgments

  • Qwen team for the excellent base model
  • MobiusML for the HQQ quantization method
  • Hugging Face for transformers integration
Downloads last month
8
Safetensors
Model size
5B params
Tensor type
I64
Β·
F16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LumenAI/qwen2.5-vl-7b-hqq-4bit-clean

Quantized
(126)
this model

Evaluation results