vllm

#2
by win10 - opened

@cpatonn Can you help me?
"""
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] EngineCore failed to start.
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] Traceback (most recent call last):
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 858, in run_engine_core
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 634, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] super().init(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self._init_executor()
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.driver_worker.init_device()
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in init_device
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 564, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] MultiModalBudget(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 42, in init
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] max_tokens_by_modality = mm_registry.get_max_tokens_per_item_by_modality(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_modality
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return profiler.get_mm_max_contiguous_tokens(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 369, in get_mm_max_contiguous_tokens
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return self._get_mm_max_tokens(seq_len, mm_counts, mm_embeddings_only=False)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 351, in _get_mm_max_tokens
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 263, in _get_dummy_mm_inputs
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] processor_inputs = factory.get_dummy_processor_inputs(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 120, in get_dummy_processor_inputs
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] dummy_text = self.get_dummy_text(mm_counts)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1176, in get_dummy_text
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] hf_processor = self.info.get_hf_processor()
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1186, in get_hf_processor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return self.ctx.get_hf_processor(**kwargs)
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1049, in get_hf_processor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return cached_processor_from_config(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 251, in cached_processor_from_config
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] return cached_get_processor_without_dynamic_kwargs(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 210, in cached_get_processor_without_dynamic_kwargs
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] processor = cached_get_processor(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 155, in get_processor
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] raise TypeError(
(EngineCore_DP0 pid=62) ERROR 12-10 08:22:07 [core.py:867] TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
(EngineCore_DP0 pid=62) Process EngineCore_DP0:
(EngineCore_DP0 pid=62) Traceback (most recent call last):
(EngineCore_DP0 pid=62) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=62) self.run()
(EngineCore_DP0 pid=62) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=62) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 871, in run_engine_core
(EngineCore_DP0 pid=62) raise e
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 858, in run_engine_core
(EngineCore_DP0 pid=62) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 634, in init
(EngineCore_DP0 pid=62) super().init(
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in init
(EngineCore_DP0 pid=62) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in init
(EngineCore_DP0 pid=62) self._init_executor()
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=62) self.driver_worker.init_device()
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=62) self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 273, in init_device
(EngineCore_DP0 pid=62) self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 564, in init
(EngineCore_DP0 pid=62) MultiModalBudget(
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 42, in init
(EngineCore_DP0 pid=62) max_tokens_by_modality = mm_registry.get_max_tokens_per_item_by_modality(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 167, in get_max_tokens_per_item_by_modality
(EngineCore_DP0 pid=62) return profiler.get_mm_max_contiguous_tokens(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 369, in get_mm_max_contiguous_tokens
(EngineCore_DP0 pid=62) return self._get_mm_max_tokens(seq_len, mm_counts, mm_embeddings_only=False)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 351, in _get_mm_max_tokens
(EngineCore_DP0 pid=62) mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 263, in _get_dummy_mm_inputs
(EngineCore_DP0 pid=62) processor_inputs = factory.get_dummy_processor_inputs(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/profiling.py", line 120, in get_dummy_processor_inputs
(EngineCore_DP0 pid=62) dummy_text = self.get_dummy_text(mm_counts)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_1v.py", line 1176, in get_dummy_text
(EngineCore_DP0 pid=62) hf_processor = self.info.get_hf_processor()
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1186, in get_hf_processor
(EngineCore_DP0 pid=62) return self.ctx.get_hf_processor(**kwargs)
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/processing.py", line 1049, in get_hf_processor
(EngineCore_DP0 pid=62) return cached_processor_from_config(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 251, in cached_processor_from_config
(EngineCore_DP0 pid=62) return cached_get_processor_without_dynamic_kwargs(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 210, in cached_get_processor_without_dynamic_kwargs
(EngineCore_DP0 pid=62) processor = cached_get_processor(
(EngineCore_DP0 pid=62) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/processor.py", line 155, in get_processor
(EngineCore_DP0 pid=62) raise TypeError(
(EngineCore_DP0 pid=62) TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
[rank0]:[W1210 08:22:08.174477121 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
"""

Got this far:

#!/bin/bash
uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly # add variant subdirectory here if needed
uv pip install --upgrade git+https://github.com/huggingface/transformers.git
uv pip install "numpy<2.3"

But end up with triton method lookup error:

(EngineCore_DP0 pid=247846) ImportError: /home/<user>/.triton/cache/KJGBFCDPODDSTMI2D3ESJLJ6JDTKTCDGYNBY5NRKEHPZNNN6VV2Q/cuda_utils.cpython-312-x86_64-linux-gnu.so: undefined symbol: cuModuleGetFunction

Getting the runaround from both GPT-5.1 and Gemini-3-pro on this one. Seems like there's some instability at the cutting edge of this stuff (as seems to always be the case with vllm.......)

Fails for me on CUDA 12.8, 12.9, and 13.0, blowing away the ~/.triton/cache or entire ~/.triton on each attempt.

cyankiwi org

@win10 I think there is something wrong with your transformers version. Could you try installing transformers from source using

pip install --upgrade git+https://github.com/huggingface/transformers.git
cyankiwi org

@jmckenzie-dev I think there is a mismatch between your CUDA, pytorch and vllm. Could you try installing the pytorch version suitable with your CUDA following pytorch.org, and then build vllm from source using your existing pytorch?

git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .

and then install transformers from source

uv pip install --upgrade git+https://github.com/huggingface/transformers.git

It might take time to build vllm from sources, and you might need to install some additional libraries in the process. Or maybe try docker?

Oh, it's very possible that's the case. I just wanted to avoid going back into that hell-hole that is building vllm from source again. :)
The recommended instructions to update transformers end up with a numpy version break too so that has to be downgraded.

I'll see about doing the custom full-stack vllm from HEAD build locally later today once I'm done doing an exl3 quant of Devstral-2 and seeing how that goes.

Thanks for the ping.

I've also faced countless errors with various combination of component versions. It ended up serving with the below:

uv venv venv-glm4.6v --python 3.13 --seed
source venv-glm4.6v/bin/activate
uv pip install vllm>=0.12.0
uv pip install --upgrade git+https://github.com/huggingface/transformers.git
uv pip install numpy==2.2.6

I also faced an issue, were VLLM would get stuck and don't show any logs. It turned out to be a silent download of the model https://github.com/vllm-project/vllm/issues/17676 .

I use the below command for serving on 4x3090:

vllm serve cyankiwi/GLM-4.6V-AWQ-4bit  --max-model-len 20000  --tensor-parallel-size 1  --pipeline-parallel-size 4 --enable-expert-parallel --host 0.0.0.0 --port 8001

I did finally get things working w/a similar approach (upgraded transformers, downgraded numpy pinned). Generation seemed to work fine enough though local inference was missing the opening block and didn't finish its generation intermittently. Not sure if it's a chat template thing on my end or what, but chat template errors seem to be incredibly common for new models in various inference environments. Thanks for the assist @cpatonn .

Thank you for the quant @cpatonn

Sharing my dockerfile for easier build

FROM nvidia/cuda:12.4.0-devel-ubuntu22.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    UV_SYSTEM_PYTHON=1

# Install system dependencies and add PPA for newer Python
RUN apt-get update && apt-get install -y \
    software-properties-common \
    git \
    curl \
    && add-apt-repository ppa:deadsnakes/ppa -y \
    && apt-get update \
    && apt-get install -y \
    python3.13 \
    python3.13-dev \
    python3.13-venv \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Install uv
RUN curl -LsSf https://astral.sh/uv/install.sh | sh

ENV PATH="/root/.local/bin:/root/.cargo/bin:$PATH"

# Set Python 3.13 as default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.13 1 && \
    update-alternatives --install /usr/bin/python python /usr/bin/python3.13 1

# Install Python packages
RUN uv pip install --system vllm>=0.12.0
RUN uv pip install --system --upgrade git+https://github.com/huggingface/transformers.git
RUN uv pip install --system numpy==2.2.6


WORKDIR /workspace
EXPOSE 8000

ENTRYPOINT ["vllm", "serve"]

I built vLLM 0.16.0rc2 from source (not even on pypi yet) and was able to fix it this way.

vllm requiresments are set to constrain transformers <5.0, but the glm46v processors are in >=5.0. I built with and configured transformers 4.57.6 dated Jan 16 2026.

https://github.com/vllm-project/vllm/blob/91ac5d9bfda99745ece40f5258f17a4c0585db40/requirements/common.txt#L10

A bit of Clauding later: workaround is edit the preprocessor and video_preprocessor with a the older "4v" in the type and class names.
image

image

image

Still not perfect:

#!/usr/bin/env bash
set -euo pipefail

source /mnt/sdb/vllm/bin/activate
source /etc/vllm/env

CONTEXT_LENGTH=131072

vllm serve \
    cyankiwi/GLM-4.6V-AWQ-4bit \
    --served-model-name glm46v \
    --swap-space 4 \
    --max-num-seqs 20 \
    --max-model-len $CONTEXT_LENGTH \
    --gpu-memory-utilization 0.95 \
    --limit-mm-per-prompt '{"video":0,"image":1}' \
    --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 850000}' \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port "${VLLM_PORT}" \
    --attention-backend FLASHINFER

...
(APIServer pid=200942) Keyword argument min_pixels is not a valid argument for this processor and will be ignored.
(APIServer pid=200942) Keyword argument max_pixels is not a valid argument for this processor and will be ignored.
...
Can edit the preprocessor_config.json manually and set size.shortest_edge and size.longest_edge perhaps

Workaround just uses the old GLM4v code to pass parameters in to the already functional code in vLLM.
image

Perhaps you can simply install transformer >5.0 now and it just works anyway despite the major version change, didn't try yet, but one might imagine there are reasons vllm set the constraint to <5.0. It might break other things even if this model works?

Sign up or log in to comment