Skip to content

[Bug] RuntimeError: RMSNorm failed with error code invalid configuration argument #3304

@YJHMITWEB

Description

@YJHMITWEB

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Hi, I am using the main branch of SGLang, and downloading Mixtral-8x22B from huggingface.

CUDA: 12.4
2 nodes, each has 4 H100 96GB.

I am deploying the server using:

python -m sglang.launch_server --model-path Mixtral-8x22B-v0.1 --tp 8 --dist-init-addr xxx:5000 --nnodes 2 --node-rank 0 --trust-remote-code --disable-cuda-graph
python -m sglang.launch_server --model-path Mixtral-8x22B-v0.1 --tp 8 --dist-init-addr xxx:5000 --nnodes 2 --node-rank 1 --trust-remote-code --disable-cuda-graph

And I am running the MMLU benchmark:

cd sglang/benchmark/mmlu
python3 bench_sglang.py --nsub 10

It pops out the error:

[2025-02-04 21:18:29 DP3 TP3] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
    self.forward_thread_func_()
  File "python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "sglang/python/sglang/srt/model_executor/model_runner.py", line 787, in forward
    return self.forward_idle(forward_batch)
  File "sglang/python/sglang/srt/model_executor/model_runner.py", line 770, in forward_idle
    return self.model.forward(
  File "sglang/python/sglang/srt/models/mixtral.py", line 314, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "sglang/python/sglang/srt/models/mixtral.py", line 286, in forward
    hidden_states, residual = layer(
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "sglang/python/sglang/srt/models/mixtral.py", line 232, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "python3.10/site-packages/vllm/model_executor/custom_op.py", line 26, in forward
    return self._forward_method(*args, **kwargs)
  File "sglang/python/sglang/srt/layers/layernorm.py", line 59, in forward_cuda
    out = rmsnorm(x, self.weight.data, self.variance_epsilon)
  File "python3.10/site-packages/sgl_kernel/ops/__init__.py", line 156, in rmsnorm
    torch.ops.sgl_kernels.rmsnorm(out, input, weight, eps, _get_cuda_stream(device))
  File "python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "python3.10/site-packages/torch/utils/_device.py", line 106, in __torch_function__
    return func(*args, **kwargs)
  File "python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: RMSNorm failed with error code invalid configuration argument

Reproduction

Model: Mixtral 8x22B
Script: MMLU benchmark

Please see above.

Environment

Python: 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA H100
GPU 0,1,2,3 Compute Capability: 9.0
CUDA_HOME: cuda/gcc/11.3.1/12.4.1-r5e7ajh
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.90.12
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.61.0
anthropic: 0.45.2
decord: 0.6.0

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions