Instructions to use ruygar/gemma-4-E2B-it-BB with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ruygar/gemma-4-E2B-it-BB with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ruygar/gemma-4-E2B-it-BB") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ruygar/gemma-4-E2B-it-BB") model = AutoModelForImageTextToText.from_pretrained("ruygar/gemma-4-E2B-it-BB") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ruygar/gemma-4-E2B-it-BB with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ruygar/gemma-4-E2B-it-BB" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ruygar/gemma-4-E2B-it-BB", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ruygar/gemma-4-E2B-it-BB
- SGLang
How to use ruygar/gemma-4-E2B-it-BB with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ruygar/gemma-4-E2B-it-BB" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ruygar/gemma-4-E2B-it-BB", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ruygar/gemma-4-E2B-it-BB" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ruygar/gemma-4-E2B-it-BB", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ruygar/gemma-4-E2B-it-BB with Docker Model Runner:
docker model run hf.co/ruygar/gemma-4-E2B-it-BB
Gemma 4 E2B - 4-Bit Quantized (Multimodal)
This repository contains native 4-bit bitsandbytes quantized weights for Google's Gemma 4 E2B (Instruct).
These models were compressed using the Hugging Face transformers and bitsandbytes libraries. Crucially, unlike GGUF and other third-party formats that often require stripping out or separating vision and audio projectors, this native Hugging Face quantization fully retains the model's native multimodal capabilities. You can process text, images, and audio directly in PyTorch with highly efficient GPU execution.
The "E" stands for effective parameters; this model uses Per-Layer Embeddings (PLE) to achieve the reasoning depth of a larger model while only utilizing 2.3B effective parameters (out of 5.1B total) during generation.
📦 Quantization Format
| Model Directory | Bit-Rate | Quantization Type | Description |
|---|---|---|---|
gemma-4-E2B-it-q4 |
4-bit | NF4 (NormalFloat4) |
Recommended. Maximum VRAM savings while maintaining high reasoning and full multimodal (vision/audio) capabilities. |
Note: Context windows (e.g., 8K or the maximum 128K) and large multimodal inputs will require additional VRAM allocation for the KV Cache during generation.
🚀 How to Use
Because these weights are natively integrated with Hugging Face, you do not need an external framework like llama.cpp. You can load and run these directly using the transformers library, keeping all multimodal pipelines completely intact.
🐍 Python (Hugging Face Transformers)
First, ensure you have the required libraries installed:
pip install transformers accelerate bitsandbytes
Then, you can load the model dynamically into your VRAM:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Define the local path or repo ID
model_id = "./gemma-4-E2B-it-q4"
# Load the tokenizer (Note: For multimodal tasks, you would also load the AutoProcessor here)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the pre-quantized 4-bit model
# device_map="auto" will automatically dispatch layers to your GPU
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16
)
# Format the prompt using the chat template
messages = [
{"role": "system", "content": "You are a helpful expert assistant."},
{"role": "user", "content": "Explain quantum entanglement in simple terms."}
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
# Generate response
outputs = model.generate(
inputs,
max_new_tokens=512,
do_sample=True,
temperature=1.0
)
# Decode and print
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
⚖️ License & Acknowledgements
These weights are derivative works of Google's Gemma 4 E2B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.
- Downloads last month
- 48