Instructions to use saricles/MiniMax-M2.7-NVFP4-GB10 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use saricles/MiniMax-M2.7-NVFP4-GB10 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="saricles/MiniMax-M2.7-NVFP4-GB10", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("saricles/MiniMax-M2.7-NVFP4-GB10", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("saricles/MiniMax-M2.7-NVFP4-GB10", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use saricles/MiniMax-M2.7-NVFP4-GB10 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "saricles/MiniMax-M2.7-NVFP4-GB10"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/MiniMax-M2.7-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/saricles/MiniMax-M2.7-NVFP4-GB10

SGLang

How to use saricles/MiniMax-M2.7-NVFP4-GB10 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "saricles/MiniMax-M2.7-NVFP4-GB10" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/MiniMax-M2.7-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "saricles/MiniMax-M2.7-NVFP4-GB10" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/MiniMax-M2.7-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use saricles/MiniMax-M2.7-NVFP4-GB10 with Docker Model Runner:
```
docker model run hf.co/saricles/MiniMax-M2.7-NVFP4-GB10
```

run the model on sglang, and got the following error

by mootuckoo - opened Apr 21

Discussion

mootuckoo

Apr 21

docker image: scitrera/dgx-spark-sglang:0.5.9-dev2-acab24a7-t5
docker command (master):

SGLANG_ENABLE_SPEC_V2=true python3 -m sglang.launch_server \
    --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 \
    --served-model-name minimax \
    --context-length 65536 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --load-format safetensors \
    --tp-size 2 \
    --host 0.0.0.0 \
    --port 30000 \
    --enable-metrics  \
    --attention-backend triton \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --mem-fraction-static 0.90 \
    --max-running-requests 2 \
    --kv-cache-dtype auto \
    --quantization modelopt_fp4 \
    --disable-piecewise-cuda-graph  \
    --cuda-graph-max-bs 2 \
    --trust-remote-code \
    --dist-init-addr 192.168.100.11:50000 \
    --nnodes 2 \
    --node-rank 0

docker command (worker):

SGLANG_ENABLE_SPEC_V2=true python3 -m sglang.launch_server \
    --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 \
    --served-model-name minimax \
    --context-length 65536 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --load-format safetensors \
    --tp-size 2 \
    --host 0.0.0.0 \
    --port 30000 \
    --enable-metrics  \
    --attention-backend triton \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --mem-fraction-static 0.90 \
    --max-running-requests 2 \
    --kv-cache-dtype auto \
    --quantization modelopt_fp4 \
    --disable-piecewise-cuda-graph  \
    --cuda-graph-max-bs 2 \
    --trust-remote-code \
    --dist-init-addr 192.168.100.11:50000 \
    --nnodes 2 \
    --node-rank 1

Log:

NCCL INFO Channel 05/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark1:154:154 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark1:154:154 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark1:154:154 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 0
spark1:154:154 [0] NCCL INFO Connected all trees
spark1:154:154 [0] NCCL INFO Connected binomial trees
spark1:154:154 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
spark1:154:154 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer
spark1:154:154 [0] NCCL INFO Symmetric memory is not supported. cuMemEnable 0, ginSupport 1, globalNicFused 0
spark1:154:154 [0] NCCL INFO CC Off, workFifoBytes 1048576
spark1:154:154 [0] NCCL INFO ncclCommInitRankConfig comm 0x6b360870 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId f01000 commId 0x7c1d9031e3ce4268 - Init COMPLETE
spark1:154:154 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.34 (kernels 0.04, alloc 0.00, bootstrap 0.13, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.17, rest 0.00)
[2026-04-21 12:03:15 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-04-21 12:03:16] [32mINFO[0m:     127.0.0.1:52312 - "[1mPOST /generate HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:03:16] The server is fired up and ready to roll!
[2026-04-21 12:03:22] [32mINFO[0m:     100.109.56.33:36534 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:03:33] [32mINFO[0m:     100.109.56.33:40062 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:03:43] [32mINFO[0m:     100.109.56.33:34192 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:03:53] [32mINFO[0m:     100.109.56.33:33024 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:04:03] [32mINFO[0m:     100.109.56.33:54642 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:04:13] [32mINFO[0m:     100.109.56.33:58456 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:04:23] [32mINFO[0m:     100.109.56.33:54762 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:04:33] [32mINFO[0m:     100.109.56.33:33686 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:04:43] [32mINFO[0m:     100.109.56.33:34000 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:04:53] [32mINFO[0m:     100.109.56.33:48168 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:05:04] [32mINFO[0m:     100.109.56.33:40320 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:05:14] [32mINFO[0m:     100.109.56.33:53680 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:05:17 TP0] Prefill batch, #new-seq: 1, #new-token: 40, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.05, cuda graph: False
[2026-04-21 12:05:18 TP0] Decode batch, #running-req: 1, #token: 73, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.06, #queue-req: 0
[2026-04-21 12:05:19 TP0] Decode batch, #running-req: 1, #token: 113, token usage: 0.00, cuda graph: True, gen throughput (token/s): 34.78, #queue-req: 0
[2026-04-21 12:05:20 TP0] Decode batch, #running-req: 1, #token: 153, token usage: 0.00, cuda graph: True, gen throughput (token/s): 34.71, #queue-req: 0
[2026-04-21 12:05:21] [32mINFO[0m:     100.109.56.33:53680 - "[1mPOST /v1/chat/completions HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:05:24] [32mINFO[0m:     100.109.56.33:53680 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:05:34] [32mINFO[0m:     100.109.56.33:46724 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:05:44] [32mINFO[0m:     100.109.56.33:47678 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:05:54] [32mINFO[0m:     100.109.56.33:37898 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:04] [32mINFO[0m:     100.109.56.33:51120 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:11 TP0] Prefill batch, #new-seq: 1, #new-token: 22, #cached-token: 38, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.75, cuda graph: False
[2026-04-21 12:06:11 TP0] Decode batch, #running-req: 1, #token: 75, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.79, #queue-req: 0
[2026-04-21 12:06:12] [32mINFO[0m:     100.109.56.33:52870 - "[1mPOST /v1/chat/completions HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:14] [32mINFO[0m:     100.109.56.33:52870 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:25] [32mINFO[0m:     100.109.56.33:46572 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:35 TP0] Prefill batch, #new-seq: 1, #new-token: 30, #cached-token: 58, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.90, cuda graph: False
[2026-04-21 12:06:35] [32mINFO[0m:     100.109.56.33:40296 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:36 TP0] Decode batch, #running-req: 1, #token: 110, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.62, #queue-req: 0
[2026-04-21 12:06:36] [32mINFO[0m:     100.109.56.33:46586 - "[1mPOST /v1/chat/completions HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:46] [32mINFO[0m:     100.109.56.33:57402 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:06:56] [32mINFO[0m:     100.109.56.33:48940 - "[1mGET /metrics HTTP/1.1[0m" [32m200 OK[0m
[2026-04-21 12:07:06 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 3370, in run_scheduler_process
    scheduler.run_event_loop()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 1241, in run_event_loop
    dispatch_event_loop(self)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 3246, in dispatch_event_loop
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 1311, in event_loop_overlap
    pop_and_process()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 1281, in pop_and_process
    self.process_batch_result(tmp_batch, tmp_result)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 2614, in process_batch_result
    self.process_batch_result_prefill(batch, result)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler_output_processor_mixin.py", line 130, in process_batch_result_prefill
    result.copy_done.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 245, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[2026-04-21 12:07:06] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from currentStreamCaptureStatusMayInitCtx at /build/pytorch/c10/cuda/CUDAGraphsC10Utils.h:71 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xe8 (0xe6e0043a2d48 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x290 (0xe6e0044a33b0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x109bee4 (0xe6e0051abee4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x480650 (0xe6e004380650 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #4: c10::TensorImpl::~TensorImpl() + 0x14 (0xe6e004333504 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x5f8558 (0xe6e019798558 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc19bbc (0xe6e019db9bbc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: sglang::scheduler_TP0() [0x524e74]
frame #8: sglang::scheduler_TP0() [0x4fad7c]
frame #9: sglang::scheduler_TP0() [0x524cf0]
frame #10: sglang::scheduler_TP0() [0x4d67d0]
frame #11: sglang::scheduler_TP0() [0x5a67ec]
frame #12: sglang::scheduler_TP0() [0x5a67f4]
frame #13: sglang::scheduler_TP0() [0x5a67f4]
frame #14: sglang::scheduler_TP0() [0x5a67f4]
frame #15: sglang::scheduler_TP0() [0x5a67f4]
frame #16: sglang::scheduler_TP0() [0x5a67f4]
frame #17: sglang::scheduler_TP0() [0x4cf898]
frame #18: sglang::scheduler_TP0() [0x524e74]
frame #19: _PyEval_EvalFrameDefault + 0x420c (0x5688d0 in sglang::scheduler_TP0)
frame #20: PyEval_EvalCode + 0x130 (0x5632b4 in sglang::scheduler_TP0)
frame #21: PyRun_StringFlags + 0xe0 (0x59c330 in sglang::scheduler_TP0)
frame #22: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in sglang::scheduler_TP0)
frame #23: Py_RunMain + 0x390 (0x68b380 in sglang::scheduler_TP0)
frame #24: Py_BytesMain + 0x28 (0x68ae88 in sglang::scheduler_TP0)
frame #25: <unknown function> + 0x284c4 (0xe6e094f084c4 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #26: __libc_start_main + 0x98 (0xe6e094f08598 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #27: _start + 0x30 (0x5f6770 in sglang::scheduler_TP0)

Fatal Python error: Aborted

Thread 0x0000e6c9abfff180 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/watchdog.py", line 145 in _watchdog_once
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/watchdog.py", line 125 in _watchdog_thread
  File "/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x0000e6db09e4f180 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread
  File "/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x0000e6c9b7fff180 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x0000e6c96ffff180 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Current thread 0x0000e6e095205720 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 3375 in run_scheduler_process
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pybase64._pybase64, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, zmq.backend.cython._zmq, PIL._imaging, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, markupsafe._speedups, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings._bindings.cynvrtc, cuda.bindings.cynvrtc, cuda.bindings.nvrtc, sentencepiece._sentencepiece, regex._regex, yaml._yaml, cuda_utils, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.encparams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._pcg64, numpy.random._generator, numpy.random._mt19937, numpy.random._philox, numpy.random._sfc64, numpy.random.mtrand, _cffi_backend, _cyutility, scipy._cyutility, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._batched_linalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpacklib, scipy.sparse.linalg._propack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation_cy, scipy.spatial.transform._rigid_transform_cy, scipy.optimize._direct, setproctitle._setproctitle, msgspec._core, __triton_launcher, uvloop.loop (total: 155)
/workspace/.monitor/monitor-launch-minimax.sh.rendered.body.sh: line 38:    53 Killed                  SGLANG_ENABLE_SPEC_V2=true python3 -m sglang.launch_server --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 --served-model-name minimax --context-length 65536 --model-loader-extra-config '{"enable_multithread_load": true}' --load-format safetensors --tp-size 2 --host 0.0.0.0 --port 30000 --enable-metrics --attention-backend triton --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --mem-fraction-static 0.90 --max-running-requests 2 --kv-cache-dtype auto --quantization modelopt_fp4 --disable-piecewise-cuda-graph --cuda-graph-max-bs 2 --trust-remote-code --dist-init-addr 192.168.100.11:50000 --nnodes 2 --node-rank 0

saricles

Owner Apr 22

Heads up: we've been running exclusively vLLM on GB10 and have literally zero SGLang experience, so take this with that caveat.

The first place I'd look into is kernel-compat between SGLang's Triton attention backend and the NVFP4-quantized attention in this quant. Our GB10 variants intentionally keep self_attn in NVFP4 (not BF16 like the standard NVFP4 reference config), which means the attention kernel path needs to understand NVFP4 tensor layout. Flashinfer does; Triton may not on your SGLang build.

Worth trying --attention-backend flashinfer, --kv-cache-dtype fp8_e4m3 (not auto), and if possible the voipmonitor/sglang:cu130 image.

If you want an NVFP4 variant with BF16 attention as a clean A/B, lukealonso/MiniMax-M2.7-NVFP4 would be a drop-in to test whether attention-quant is the variable.

saricles

Owner Apr 22

Pulled up MiniMax'''s official SGLang deploy guide: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/docs/sglang_deploy_guide.md

Their recommended command is much more minimal than yours:

python -m sglang.launch_server \
  --model-path <path> \
  --tp-size 4 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --trust-remote-code \
  --mem-fraction-static 0.85

What jumps out: they don'''t set --attention-backend, --kv-cache-dtype, --cuda-graph-max-bs, or --disable-piecewise-cuda-graph at all — your command has all four as overrides. The most likely suspect for the illegal-memory error is --disable-piecewise-cuda-graph, since MiniMax'''s vLLM doc (same model, different framework) explicitly calls out cudagraph_mode=PIECEWISE as the fix for that exact cudaErrorIllegalAddress. You'''re disabling what they recommend enabling.

For the NVFP4 quant specifically, --quantization modelopt_fp4 may still be needed (their baseline assumes the official FP8 release, not our NVFP4 repack) — leave that one in.

Suggested test: strip your command down to MiniMax'''s recommended base flags + --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 + --quantization modelopt_fp4. If it runs, add back one flag at a time to find the actual culprit.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment