run the model on sglang, and got the following error

#2
by mootuckoo - opened

docker image: scitrera/dgx-spark-sglang:0.5.9-dev2-acab24a7-t5
docker command (master):

SGLANG_ENABLE_SPEC_V2=true python3 -m sglang.launch_server \
    --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 \
    --served-model-name minimax \
    --context-length 65536 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --load-format safetensors \
    --tp-size 2 \
    --host 0.0.0.0 \
    --port 30000 \
    --enable-metrics  \
    --attention-backend triton \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --mem-fraction-static 0.90 \
    --max-running-requests 2 \
    --kv-cache-dtype auto \
    --quantization modelopt_fp4 \
    --disable-piecewise-cuda-graph  \
    --cuda-graph-max-bs 2 \
    --trust-remote-code \
    --dist-init-addr 192.168.100.11:50000 \
    --nnodes 2 \
    --node-rank 0

docker command (worker):

SGLANG_ENABLE_SPEC_V2=true python3 -m sglang.launch_server \
    --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 \
    --served-model-name minimax \
    --context-length 65536 \
    --model-loader-extra-config '{"enable_multithread_load": true}' \
    --load-format safetensors \
    --tp-size 2 \
    --host 0.0.0.0 \
    --port 30000 \
    --enable-metrics  \
    --attention-backend triton \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --mem-fraction-static 0.90 \
    --max-running-requests 2 \
    --kv-cache-dtype auto \
    --quantization modelopt_fp4 \
    --disable-piecewise-cuda-graph  \
    --cuda-graph-max-bs 2 \
    --trust-remote-code \
    --dist-init-addr 192.168.100.11:50000 \
    --nnodes 2 \
    --node-rank 1

Log:

NCCL INFO Channel 05/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark1:154:154 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark1:154:154 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark1:154:154 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 0
spark1:154:154 [0] NCCL INFO Connected all trees
spark1:154:154 [0] NCCL INFO Connected binomial trees
spark1:154:154 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
spark1:154:154 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer
spark1:154:154 [0] NCCL INFO Symmetric memory is not supported. cuMemEnable 0, ginSupport 1, globalNicFused 0
spark1:154:154 [0] NCCL INFO CC Off, workFifoBytes 1048576
spark1:154:154 [0] NCCL INFO ncclCommInitRankConfig comm 0x6b360870 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId f01000 commId 0x7c1d9031e3ce4268 - Init COMPLETE
spark1:154:154 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 0.34 (kernels 0.04, alloc 0.00, bootstrap 0.13, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.17, rest 0.00)
[2026-04-21 12:03:15 TP0] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-04-21 12:03:16] INFO:     127.0.0.1:52312 - "POST /generate HTTP/1.1" 200 OK
[2026-04-21 12:03:16] The server is fired up and ready to roll!
[2026-04-21 12:03:22] INFO:     100.109.56.33:36534 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:03:33] INFO:     100.109.56.33:40062 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:03:43] INFO:     100.109.56.33:34192 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:03:53] INFO:     100.109.56.33:33024 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:04:03] INFO:     100.109.56.33:54642 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:04:13] INFO:     100.109.56.33:58456 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:04:23] INFO:     100.109.56.33:54762 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:04:33] INFO:     100.109.56.33:33686 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:04:43] INFO:     100.109.56.33:34000 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:04:53] INFO:     100.109.56.33:48168 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:05:04] INFO:     100.109.56.33:40320 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:05:14] INFO:     100.109.56.33:53680 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:05:17 TP0] Prefill batch, #new-seq: 1, #new-token: 40, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.05, cuda graph: False
[2026-04-21 12:05:18 TP0] Decode batch, #running-req: 1, #token: 73, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.06, #queue-req: 0
[2026-04-21 12:05:19 TP0] Decode batch, #running-req: 1, #token: 113, token usage: 0.00, cuda graph: True, gen throughput (token/s): 34.78, #queue-req: 0
[2026-04-21 12:05:20 TP0] Decode batch, #running-req: 1, #token: 153, token usage: 0.00, cuda graph: True, gen throughput (token/s): 34.71, #queue-req: 0
[2026-04-21 12:05:21] INFO:     100.109.56.33:53680 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-21 12:05:24] INFO:     100.109.56.33:53680 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:05:34] INFO:     100.109.56.33:46724 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:05:44] INFO:     100.109.56.33:47678 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:05:54] INFO:     100.109.56.33:37898 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:06:04] INFO:     100.109.56.33:51120 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:06:11 TP0] Prefill batch, #new-seq: 1, #new-token: 22, #cached-token: 38, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.75, cuda graph: False
[2026-04-21 12:06:11 TP0] Decode batch, #running-req: 1, #token: 75, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.79, #queue-req: 0
[2026-04-21 12:06:12] INFO:     100.109.56.33:52870 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-21 12:06:14] INFO:     100.109.56.33:52870 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:06:25] INFO:     100.109.56.33:46572 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:06:35 TP0] Prefill batch, #new-seq: 1, #new-token: 30, #cached-token: 58, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.90, cuda graph: False
[2026-04-21 12:06:35] INFO:     100.109.56.33:40296 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:06:36 TP0] Decode batch, #running-req: 1, #token: 110, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.62, #queue-req: 0
[2026-04-21 12:06:36] INFO:     100.109.56.33:46586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-21 12:06:46] INFO:     100.109.56.33:57402 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:06:56] INFO:     100.109.56.33:48940 - "GET /metrics HTTP/1.1" 200 OK
[2026-04-21 12:07:06 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 3370, in run_scheduler_process
    scheduler.run_event_loop()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 1241, in run_event_loop
    dispatch_event_loop(self)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 3246, in dispatch_event_loop
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 1311, in event_loop_overlap
    pop_and_process()
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 1281, in pop_and_process
    self.process_batch_result(tmp_batch, tmp_result)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 2614, in process_batch_result
    self.process_batch_result_prefill(batch, result)
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler_output_processor_mixin.py", line 130, in process_batch_result_prefill
    result.copy_done.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 245, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[2026-04-21 12:07:06] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from currentStreamCaptureStatusMayInitCtx at /build/pytorch/c10/cuda/CUDAGraphsC10Utils.h:71 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xe8 (0xe6e0043a2d48 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x290 (0xe6e0044a33b0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x109bee4 (0xe6e0051abee4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x480650 (0xe6e004380650 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #4: c10::TensorImpl::~TensorImpl() + 0x14 (0xe6e004333504 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x5f8558 (0xe6e019798558 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc19bbc (0xe6e019db9bbc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: sglang::scheduler_TP0() [0x524e74]
frame #8: sglang::scheduler_TP0() [0x4fad7c]
frame #9: sglang::scheduler_TP0() [0x524cf0]
frame #10: sglang::scheduler_TP0() [0x4d67d0]
frame #11: sglang::scheduler_TP0() [0x5a67ec]
frame #12: sglang::scheduler_TP0() [0x5a67f4]
frame #13: sglang::scheduler_TP0() [0x5a67f4]
frame #14: sglang::scheduler_TP0() [0x5a67f4]
frame #15: sglang::scheduler_TP0() [0x5a67f4]
frame #16: sglang::scheduler_TP0() [0x5a67f4]
frame #17: sglang::scheduler_TP0() [0x4cf898]
frame #18: sglang::scheduler_TP0() [0x524e74]
frame #19: _PyEval_EvalFrameDefault + 0x420c (0x5688d0 in sglang::scheduler_TP0)
frame #20: PyEval_EvalCode + 0x130 (0x5632b4 in sglang::scheduler_TP0)
frame #21: PyRun_StringFlags + 0xe0 (0x59c330 in sglang::scheduler_TP0)
frame #22: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in sglang::scheduler_TP0)
frame #23: Py_RunMain + 0x390 (0x68b380 in sglang::scheduler_TP0)
frame #24: Py_BytesMain + 0x28 (0x68ae88 in sglang::scheduler_TP0)
frame #25: <unknown function> + 0x284c4 (0xe6e094f084c4 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #26: __libc_start_main + 0x98 (0xe6e094f08598 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #27: _start + 0x30 (0x5f6770 in sglang::scheduler_TP0)

Fatal Python error: Aborted

Thread 0x0000e6c9abfff180 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/watchdog.py", line 145 in _watchdog_once
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/utils/watchdog.py", line 125 in _watchdog_thread
  File "/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x0000e6db09e4f180 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread
  File "/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x0000e6c9b7fff180 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x0000e6c96ffff180 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Current thread 0x0000e6e095205720 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/scheduler.py", line 3375 in run_scheduler_process
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pybase64._pybase64, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, zmq.backend.cython._zmq, PIL._imaging, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, markupsafe._speedups, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings._bindings.cynvrtc, cuda.bindings.cynvrtc, cuda.bindings.nvrtc, sentencepiece._sentencepiece, regex._regex, yaml._yaml, cuda_utils, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.encparams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._pcg64, numpy.random._generator, numpy.random._mt19937, numpy.random._philox, numpy.random._sfc64, numpy.random.mtrand, _cffi_backend, _cyutility, scipy._cyutility, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._batched_linalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpacklib, scipy.sparse.linalg._propack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation_cy, scipy.spatial.transform._rigid_transform_cy, scipy.optimize._direct, setproctitle._setproctitle, msgspec._core, __triton_launcher, uvloop.loop (total: 155)
/workspace/.monitor/monitor-launch-minimax.sh.rendered.body.sh: line 38:    53 Killed                  SGLANG_ENABLE_SPEC_V2=true python3 -m sglang.launch_server --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 --served-model-name minimax --context-length 65536 --model-loader-extra-config '{"enable_multithread_load": true}' --load-format safetensors --tp-size 2 --host 0.0.0.0 --port 30000 --enable-metrics --attention-backend triton --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --mem-fraction-static 0.90 --max-running-requests 2 --kv-cache-dtype auto --quantization modelopt_fp4 --disable-piecewise-cuda-graph --cuda-graph-max-bs 2 --trust-remote-code --dist-init-addr 192.168.100.11:50000 --nnodes 2 --node-rank 0

Heads up: we've been running exclusively vLLM on GB10 and have literally zero SGLang experience, so take this with that caveat.

The first place I'd look into is kernel-compat between SGLang's Triton attention backend and the NVFP4-quantized attention in this quant. Our GB10 variants intentionally keep self_attn in NVFP4 (not BF16 like the standard NVFP4 reference config), which means the attention kernel path needs to understand NVFP4 tensor layout. Flashinfer does; Triton may not on your SGLang build.

Worth trying --attention-backend flashinfer, --kv-cache-dtype fp8_e4m3 (not auto), and if possible the voipmonitor/sglang:cu130 image.

If you want an NVFP4 variant with BF16 attention as a clean A/B, lukealonso/MiniMax-M2.7-NVFP4 would be a drop-in to test whether attention-quant is the variable.

Pulled up MiniMax'''s official SGLang deploy guide: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/docs/sglang_deploy_guide.md

Their recommended command is much more minimal than yours:

python -m sglang.launch_server \
  --model-path <path> \
  --tp-size 4 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --trust-remote-code \
  --mem-fraction-static 0.85

What jumps out: they don'''t set --attention-backend, --kv-cache-dtype, --cuda-graph-max-bs, or --disable-piecewise-cuda-graph at all — your command has all four as overrides. The most likely suspect for the illegal-memory error is --disable-piecewise-cuda-graph, since MiniMax'''s vLLM doc (same model, different framework) explicitly calls out cudagraph_mode=PIECEWISE as the fix for that exact cudaErrorIllegalAddress. You'''re disabling what they recommend enabling.

For the NVFP4 quant specifically, --quantization modelopt_fp4 may still be needed (their baseline assumes the official FP8 release, not our NVFP4 repack) — leave that one in.

Suggested test: strip your command down to MiniMax'''s recommended base flags + --model-path /data/hf/saricles_MiniMax-M2.7-NVFP4-GB10 + --quantization modelopt_fp4. If it runs, add back one flag at a time to find the actual culprit.

Sign up or log in to comment