Instructions to use mconcat/Qwopus3.5-27B-v3-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mconcat/Qwopus3.5-27B-v3-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mconcat/Qwopus3.5-27B-v3-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("mconcat/Qwopus3.5-27B-v3-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("mconcat/Qwopus3.5-27B-v3-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mconcat/Qwopus3.5-27B-v3-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mconcat/Qwopus3.5-27B-v3-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mconcat/Qwopus3.5-27B-v3-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mconcat/Qwopus3.5-27B-v3-NVFP4

SGLang

How to use mconcat/Qwopus3.5-27B-v3-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mconcat/Qwopus3.5-27B-v3-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mconcat/Qwopus3.5-27B-v3-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mconcat/Qwopus3.5-27B-v3-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mconcat/Qwopus3.5-27B-v3-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mconcat/Qwopus3.5-27B-v3-NVFP4 with Docker Model Runner:
```
docker model run hf.co/mconcat/Qwopus3.5-27B-v3-NVFP4
```

Qwopus3.5-27B-v3-NVFP4

Mixed-precision (NVFP4/FP8/BF16) quantized version of Jackrong/Qwopus3.5-27B-v3.

This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, applying a layerwise mixed-precision recipe that balances compression with output quality.

Verified Inference

Local inference was verified on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

vllm==0.17.1
transformers==5.3.0
llm-compressor==0.14.1.dev24

What was verified:

The mixed-precision export completed successfully via llm-compressor
MTP weights are included in the main safetensors file
The checkpoint loads and generates correct output in vLLM

Blackwell GPU Notes

On Blackwell GPUs (RTX 5090, RTX PRO 6000), two patches may be required:

TMA patch: Change >= 9 to 9 <= x < 12 in vllm/model_executor/layers/fla/ops/utils.py

UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && \
sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"

NVFP4 GEMM backend: FlashInfer's FP4 GEMM kernel has a known bug on SM120 (consumer Blackwell). Use the CUTLASS backend instead:

export VLLM_NVFP4_GEMM_BACKEND=cutlass

Quantization Strategy

Non-uniform mixed-precision quantization using llm-compressor:

Precision	Layers
FP8 W8A8	DeltaNet `in_proj_qkv`, `in_proj_z`, `out_proj`; softmax `q_proj`/`k_proj`/`v_proj`; MLP `down_proj`
NVFP4 W4A4	softmax `o_proj`; MLP `gate_proj`/`up_proj`
BF16	`lm_head`, `embed_tokens`, DeltaNet `in_proj_a`/`in_proj_b`, norms, visual encoder, MTP sidecar

Architecture match with the BF16 source:

model_type=qwen3_5
64 text layers (hybrid DeltaNet + softmax, full_attention_interval=4)
mtp_num_hidden_layers=1
max_position_embeddings=262144
hidden_size=5120, intermediate_size=17408
vocab_size=248320

Usage

vLLM

pip install -U vllm>=0.17.0 transformers>=5.3.0

Standard serving:

vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP speculative decoding:

vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

On Blackwell GPUs, add the CUTLASS backend:

VLLM_NVFP4_GEMM_BACKEND=cutlass vllm serve mconcat/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwopus3.5-27B-v3-NVFP4",
    trust_remote_code=True,
)

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Yes	Verified with `vllm==0.17.1`; MTP works; Blackwell requires CUTLASS backend
transformers >= 5.3.0	Yes	Direct loading with `device_map="auto"`
SGLang	No	compressed-tensors NVFP4 not supported

Notes

This is the smallest quantized variant (~24 GB) and fits comfortably on a single 32 GB GPU in eager mode.
MTP weights are embedded in the main model.safetensors file (no separate model.mtp.safetensors).
The model includes a vision encoder (loaded but unused for text-only inference). Use --skip-mm-profiling with vLLM to skip vision encoder profiling.
KV cache: Do not use --kv-cache-dtype fp8_e4m3 with this model family — the checkpoint lacks calibrated KV scales and will produce degraded output. Use the default BF16 KV cache.