Instructions to use mistral-experimental/pixtral-12b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mistral-experimental/pixtral-12b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="mistral-experimental/pixtral-12b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("mistral-experimental/pixtral-12b")
model = AutoModelForImageTextToText.from_pretrained("mistral-experimental/pixtral-12b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mistral-experimental/pixtral-12b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mistral-experimental/pixtral-12b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistral-experimental/pixtral-12b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/mistral-experimental/pixtral-12b

SGLang

How to use mistral-experimental/pixtral-12b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mistral-experimental/pixtral-12b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistral-experimental/pixtral-12b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mistral-experimental/pixtral-12b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistral-experimental/pixtral-12b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use mistral-experimental/pixtral-12b with Docker Model Runner:
```
docker model run hf.co/mistral-experimental/pixtral-12b
```

Ask for guilding batch inference

#15

by nguyen-brat - opened Oct 18, 2024

Discussion

nguyen-brat

Oct 18, 2024

I read the README and tried something similar with the Qwen2-vl method to run the model in batch input, but it did not work. Can you please show me how to do it?

nguyen-brat

Oct 19, 2024

Anyone who is seeking batch input inference can have a glance at my solution.:

from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(
            model_id,
            device_map="auto",
            # attn_implementation="flash_attention_2",
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True
        ).eval()
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.add_special_tokens({'pad_token': '<pad>'})
img = "../data/warn_up/warmup-images/bc24654fb4fba69b41b6b4dce15295fc4acc8ebce9b9bff452ef6a8890e04e72.jpg"
img = Image.open(img)

chat = [
    {
      "role": "user", "content": [
        {"type": "image"}, 
        {"type": "text", "content": "based on the text in this image explain why this image contain sarcasm meaning ? based on the text in this image explain why this image contain sarcasm meaning ?"}, 
      ]
    }
]

chat_2 = [
    {
      "role": "user", "content": [
        {"type": "image"}, 
        {"type": "text", "content": "based "}, 
      ]
    }
]

prompt = processor.apply_chat_template(chat)
prompt_2 = processor.apply_chat_template(chat_2)
inputs = processor(text=[prompt, prompt_2], images=[[img], [img]], return_tensors="pt", padding=True,
        padding_side="left",).to(model.device)

with torch.no_grad():
    generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output)

Have tested it and it works ok.

manan05

Oct 27, 2024

@nguyen-brat Hey it doesnt work I tried the same code and this is the error that comes and it is coming every single time
ValueError: Image features and image tokens do not match: tokens: 645, features 1290

just its variations whats happening here?

nguyen-brat

Oct 29, 2024

@manan05 you can try my anaconda environment to see if that error still pop up because i can run ok on my computer:

name: pixtral
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - asttokens=2.4.1=pyhd8ed1ab_0
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.9.24=h06a4308_0
  - comm=0.2.2=pyhd8ed1ab_0
  - debugpy=1.8.7=py310hf71b8c6_0
  - decorator=5.1.1=pyhd8ed1ab_0
  - exceptiongroup=1.2.2=pyhd8ed1ab_0
  - executing=2.1.0=pyhd8ed1ab_0
  - importlib-metadata=8.5.0=pyha770c72_0
  - ipykernel=6.29.5=pyh3099207_0
  - ipython=8.28.0=pyh707e725_0
  - jedi=0.19.1=pyhd8ed1ab_0
  - jupyter_client=8.6.3=pyhd8ed1ab_0
  - jupyter_core=5.7.2=pyh31011fe_1
  - krb5=1.21.3=h143b758_0
  - ld_impl_linux-64=2.40=h12ee557_0
  - libedit=3.1.20230828=h5eee18b_0
  - libffi=3.4.2=h7f98852_5
  - libgcc=14.2.0=h77fa898_1
  - libgcc-ng=14.2.0=h69a702a_1
  - libgomp=14.2.0=h77fa898_1
  - libnsl=2.0.1=hd590300_0
  - libsodium=1.0.20=h4ab18f5_0
  - libsqlite=3.46.1=hadc24fc_0
  - libstdcxx=14.2.0=hc0a3c3a_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=2.38.1=h0b41bf4_0
  - libxcrypt=4.4.36=hd590300_1
  - libzlib=1.3.1=hb9d3cd8_2
  - matplotlib-inline=0.1.7=pyhd8ed1ab_0
  - ncurses=6.5=he02047a_1
  - nest-asyncio=1.6.0=pyhd8ed1ab_0
  - openssl=3.3.2=hb9d3cd8_0
  - packaging=24.1=pyhd8ed1ab_0
  - parso=0.8.4=pyhd8ed1ab_0
  - pexpect=4.9.0=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - pip=24.2=py310h06a4308_0
  - platformdirs=4.3.6=pyhd8ed1ab_0
  - prompt-toolkit=3.0.48=pyha770c72_0
  - psutil=6.0.0=py310ha75aee5_2
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pure_eval=0.2.3=pyhd8ed1ab_0
  - pygments=2.18.0=pyhd8ed1ab_0
  - python=3.10.15=h4a871b0_2_cpython
  - python_abi=3.10=5_cp310
  - pyzmq=26.2.0=py310h71f11fc_3
  - readline=8.2=h5eee18b_0
  - setuptools=75.1.0=py310h06a4308_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.46.1=h9eae976_0
  - stack_data=0.6.2=pyhd8ed1ab_0
  - tk=8.6.13=noxft_h4845f30_101
  - tornado=6.4.1=py310ha75aee5_1
  - traitlets=5.14.3=pyhd8ed1ab_0
  - typing_extensions=4.12.2=pyha770c72_0
  - wcwidth=0.2.13=pyhd8ed1ab_0
  - wheel=0.44.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zeromq=4.3.5=h3b0a872_6
  - zipp=3.20.2=pyhd8ed1ab_0
  - zlib=1.3.1=hb9d3cd8_2
  - pip:
      - accelerate==1.0.1
      - aiohappyeyeballs==2.4.3
      - aiohttp==3.10.10
      - aiosignal==1.3.1
      - annotated-types==0.7.0
      - anyio==4.6.2.post1
      - async-timeout==4.0.3
      - attrs==24.2.0
      - av==13.1.0
      - certifi==2024.8.30
      - charset-normalizer==3.4.0
      - click==8.1.7
      - cloudpickle==3.1.0
      - contourpy==1.3.0
      - cycler==0.12.1
      - datasets==3.0.1
      - dill==0.3.8
      - diskcache==5.6.3
      - distro==1.9.0
      - docstring-parser==0.16
      - einops==0.8.0
      - fastapi==0.115.2
      - filelock==3.16.1
      - fire==0.7.0
      - flash-attn==2.6.3
      - fonttools==4.54.1
      - frozenlist==1.4.1
      - fsspec==2024.6.1
      - gguf==0.10.0
      - h11==0.14.0
      - httpcore==1.0.6
      - httptools==0.6.4
      - httpx==0.27.2
      - huggingface-hub==0.25.2
      - idna==3.10
      - interegular==0.3.3
      - jinja2==3.1.4
      - jiter==0.6.1
      - jsonschema==4.23.0
      - jsonschema-specifications==2024.10.1
      - kiwisolver==1.4.7
      - lark==1.2.2
      - llvmlite==0.43.0
      - lm-format-enforcer==0.10.6
      - markupsafe==3.0.1
      - matplotlib==3.9.2
      - mistral-common==1.4.4
      - mistral-inference==1.5.0
      - mpmath==1.3.0
      - msgpack==1.1.0
      - msgspec==0.18.6
      - multidict==6.1.0
      - multiprocess==0.70.16
      - networkx==3.4.1
      - numba==0.60.0
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==9.1.0.70
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-ml-py==12.560.30
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.6.77
      - nvidia-nvtx-cu12==12.1.105
      - openai==1.51.2
      - opencv-python-headless==4.10.0.84
      - outlines==0.0.46
      - pandas==2.2.3
      - partial-json-parser==0.2.1.1.post4
      - pillow==10.4.0
      - prometheus-client==0.21.0
      - prometheus-fastapi-instrumentator==7.0.0
      - propcache==0.2.0
      - protobuf==5.28.2
      - py-cpuinfo==9.0.0
      - pyairports==2.1.1
      - pyarrow==17.0.0
      - pycountry==24.6.1
      - pydantic==2.9.2
      - pydantic-core==2.23.4
      - pyparsing==3.2.0
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.0.1
      - pytz==2024.2
      - pyyaml==6.0.2
      - qwen-vl-utils==0.0.8
      - ray==2.37.0
      - referencing==0.35.1
      - regex==2024.9.11
      - requests==2.32.3
      - rpds-py==0.20.0
      - safetensors==0.4.5
      - sentencepiece==0.2.0
      - simple-parsing==0.1.6
      - sniffio==1.3.1
      - starlette==0.40.0
      - sympy==1.13.3
      - termcolor==2.5.0
      - tiktoken==0.7.0
      - tokenizers==0.20.1
      - torch==2.4.0
      - torchvision==0.19.0
      - tqdm==4.66.5
      - transformers==4.45.2
      - triton==3.0.0
      - tzdata==2024.2
      - urllib3==2.2.3
      - uvicorn==0.32.0
      - uvloop==0.21.0
      - vllm==0.6.3
      - watchfiles==0.24.0
      - websockets==13.1
      - xformers==0.0.27.post2
      - xxhash==3.5.0
      - yarl==1.15.4
prefix: /home/nguyen/miniconda3/envs/pixtral

manan05

Oct 29, 2024

@nguyen-brat Hey can you send the image you are using so we can exactly pin point the problem?

lucadiliello

Nov 4, 2024

The error appears only when using images of different size for each example in the batch, since the model is not designed for processing multiple prompts at the same time

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment