Gemma 4 E2B - 4-Bit Quantized (Multimodal)

This repository contains native 4-bit bitsandbytes quantized weights for Google's Gemma 4 E2B (Instruct).

These models were compressed using the Hugging Face transformers and bitsandbytes libraries. Crucially, unlike GGUF and other third-party formats that often require stripping out or separating vision and audio projectors, this native Hugging Face quantization fully retains the model's native multimodal capabilities. You can process text, images, and audio directly in PyTorch with highly efficient GPU execution.

The "E" stands for effective parameters; this model uses Per-Layer Embeddings (PLE) to achieve the reasoning depth of a larger model while only utilizing 2.3B effective parameters (out of 5.1B total) during generation.

📦 Quantization Format

Model Directory Bit-Rate Quantization Type Description
gemma-4-E2B-it-q4 4-bit NF4 (NormalFloat4) Recommended. Maximum VRAM savings while maintaining high reasoning and full multimodal (vision/audio) capabilities.

Note: Context windows (e.g., 8K or the maximum 128K) and large multimodal inputs will require additional VRAM allocation for the KV Cache during generation.

🚀 How to Use

Because these weights are natively integrated with Hugging Face, you do not need an external framework like llama.cpp. You can load and run these directly using the transformers library, keeping all multimodal pipelines completely intact.

🐍 Python (Hugging Face Transformers)

First, ensure you have the required libraries installed:

pip install transformers accelerate bitsandbytes

Then, you can load the model dynamically into your VRAM:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Define the local path or repo ID
model_id = "./gemma-4-E2B-it-q4" 

# Load the tokenizer (Note: For multimodal tasks, you would also load the AutoProcessor here)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the pre-quantized 4-bit model
# device_map="auto" will automatically dispatch layers to your GPU
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

# Format the prompt using the chat template
messages = [
    {"role": "system", "content": "You are a helpful expert assistant."},
    {"role": "user", "content": "Explain quantum entanglement in simple terms."}
]

inputs = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt", 
    add_generation_prompt=True
).to(model.device)

# Generate response
outputs = model.generate(
    inputs, 
    max_new_tokens=512,
    do_sample=True,
    temperature=1.0
)

# Decode and print
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

⚖️ License & Acknowledgements

These weights are derivative works of Google's Gemma 4 E2B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.

Downloads last month
48
Safetensors
Model size
5B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support