Skip to content

model(vlm): pixtral #5084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 13, 2025
Merged

model(vlm): pixtral #5084

merged 10 commits into from
May 13, 2025

Conversation

KivenChen
Copy link
Contributor

@KivenChen KivenChen commented Apr 5, 2025

Motivation

This PR is a proposal to add support for Pixtral (#2351), and removes blockers to accelerate future Mistral multimodal integrations, such as Mistral 3.1 Small. Feedback and suggestions for improvements are welcome and appreciated.

Modifications

  • PixtralVisionModel – a leading vision model from Mistral AI, also the multimodal backbone of Mistral 3.1. Accelerated by VisionAttention and MultimodalInputs.
  • LlavaForConditionalGeneration – a flexible text‑vision backbone architecture compatible with pixtral‑12b and other multimodal models. Automatically determine text and vision architecture from provided config.
  • PixtralProcessor as well as LlavaMultimodalProcessor, which auto-loads processor for any LlavaForConditionalGeneration with vision config.
  • Remove unrelated code from own use case.
  • Minor refactor to mm_utils to support multiple, arbitary images in a single request.
  • test_generation_models updated to bring in latest models.
  • Examples, documentation, and benchmarks
  • --disable-multimodal added to server args. Multimodal is enabled by default.
  • now chunked prefill is enabled for VLMs by default. Tested multi-image input on qwen2.5VL, llama4 and pixtral, and so far no issues identified.

Functionality Tests

  • text generation
  • "image" modalities
  • "multi-images" modalities (NOTE: functional and normal generation outputs, but issues in prefill chunk layer. Models may omit previously cached mm item, giving answers that miscount images or mistake input images for previous added ones).

Preliminary Benchmark

peak decode throughput: 140-150 tps for all modalities
cuda device: A100x1
vram: 40gb
tested modalities: text / image / multi-image
server args: "REQUEST_TIMEOUT=15 python3 -m sglang.launch_server --model-path mistral-community/pixtral-12b --mem-fraction-static 0.7 --dtype bfloat16"

Checklist

@KivenChen KivenChen changed the title Add Pixtral HF Support (mistral‑community/pixtral‑12b) Add Pixtral HF Support Apr 5, 2025
@KivenChen KivenChen marked this pull request as ready for review April 5, 2025 20:10
@KivenChen KivenChen mentioned this pull request Apr 6, 2025
12 tasks
@KivenChen
Copy link
Contributor Author

multi-image modality has support in the model implementation, but is yet to be integrated to multi-modal data pipelines.(#4754). My private use case branch has a minor refactor about it, but it doesn’t touch the scheduler layer; therefore potential problems are unknown.

Shall I remove functionality tests for multi-image support for now?

@zhaochenyang20
Copy link
Collaborator

I will ask yi, mick and yuhao to review this.

@yhyang201
Copy link
Contributor

  1. I tried running the modified http_pixtral_generation_test.py, but encountered the following error, which might need to be addressed:
[2025-04-07 08:16:52 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2013, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/root/.python/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 598, in event_loop_normal
    result = self.run_batch(batch)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1385, in run_batch
    logits_output, next_token_ids = self.tp_worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 175, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1000, in forward
    return self.forward_extend(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 961, in forward_extend
    return self.model.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/models/llava.py", line 749, in forward
    hidden_states = general_mm_embed_routine(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/mm_utils.py", line 352, in general_mm_embed_routine
    inputs_embeds = embed_mm_inputs(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/mm_utils.py", line 261, in embed_mm_inputs
    embedding, mask = get_embedding_and_mask(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/mm_utils.py", line 136, in get_embedding_and_mask
    embeddings = data_embedding_func(embedding_items)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llava.py", line 736, in get_image_feature
    return self.encode_images(pixel_values)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llava.py", line 133, in encode_images
    image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
  File "/root/.python/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.python/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/pixtral.py", line 497, in forward
    patch_embeds_list = [
  File "/sgl-workspace/sglang/python/sglang/srt/models/pixtral.py", line 498, in <listcomp>
    self.patch_conv(img.unsqueeze(0).to(self.dtype).to(self.device))
  File "/root/.python/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/.python/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.python/sglang/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/root/.python/sglang/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 2, 3, 400, 400]
  1. Also, I noticed that there are some debugging comments and print statements in python/sglang/srt/models/llava.py. Should these be removed?

@KivenChen
Copy link
Contributor Author

KivenChen commented Apr 7, 2025

Hi @yhyang201 , I just pushed the final commits that resolves the two problems. The PR statement has also been updated to reflect additional modifications, especially multimodality support.

@yhyang201
Copy link
Contributor

  1. When loading the model, the following error occurred:
  File "/root/sglang/kgl/python/sglang/srt/model_loader/loader.py", line 146, in _initialize_model
    return model_class(
  File "/root/sglang/kgl/python/sglang/srt/models/llava.py", line 718, in __init__
    vision_model_cls = self._get_sgl_model_cls(config.vision_config, AutoModel)
  File "/root/sglang/kgl/python/sglang/srt/models/llava.py", line 644, in _get_sgl_model_cls
    raise ValueError(
ValueError: AutoModel found a corresponding model `PixtralVisionModel` for config class `PixtralVisionConfig`, but failed to load it from SGLang ModelRegistry.

Could you help fix this issue?

  1. Could you also run the MMMU benchmark test?
    Link: https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu

@KivenChen
Copy link
Contributor Author

KivenChen commented Apr 7, 2025

  1. When loading the model, the following error occurred:

  File "/root/sglang/kgl/python/sglang/srt/model_loader/loader.py", line 146, in _initialize_model

    return model_class(

  File "/root/sglang/kgl/python/sglang/srt/models/llava.py", line 718, in __init__

    vision_model_cls = self._get_sgl_model_cls(config.vision_config, AutoModel)

  File "/root/sglang/kgl/python/sglang/srt/models/llava.py", line 644, in _get_sgl_model_cls

    raise ValueError(

ValueError: AutoModel found a corresponding model `PixtralVisionModel` for config class `PixtralVisionConfig`, but failed to load it from SGLang ModelRegistry.

Could you help fix this issue?

  1. Could you also run the MMMU benchmark test?

Link: https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu

Will share benchmark after completion.

As for automodel issue, it should now be fixed. I believe it was caused by incorrect or circular imports. This PR involves dynamic model loading, so I'll keep testing.

@KivenChen
Copy link
Contributor Author

KivenChen commented Apr 8, 2025

Generation Model Test & MMMU results . @zhaochenyang20 @yhyang201

MMMU

pixtral-12b scored around 44 with direct prompt; but Mistral-AI claimed that its MMMU score goes up to 50.9 with CoT prompt.

Note: I extended the original mmmu/bench_sglang to optionally call /generate for benchmark, b/c pixtral is a base model. Moreover, a known issue (#3304) was encountered if mistral chat templates were applied. This change is pushed to this branch.

Benchmark Details
python3 -m sglang.launch_server --model-path /workspace/.cache/hub/mistral-community/pixtral-12b --mem-fraction-static 0.75 --tp-size 2 --dtype bfloat16 --log-level debug --log-requests --log-requests-level 2 --cuda-graph-max-bs 32 

# cuda device: A100 x 2. `tp=2` to support large images.
skipping 15 samples with large images, 1.67% of dataset
samples have been prepared
Processing samples: 100%|█████████████████████████████████████████████| 885/885 [04:11<00:00,  3.52it/s]
Benchmark time: 251.45550560951233
answers saved to: ./val_sglang.json
Evaluating...
{'Accounting': {'acc': 0.3, 'num': 30},
 'Agriculture': {'acc': 0.312, 'num': 16},
 'Architecture_and_Engineering': {'acc': 0.267, 'num': 30},
 'Art': {'acc': 0.533, 'num': 30},
 'Art_Theory': {'acc': 0.633, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.433, 'num': 30},
 'Biology': {'acc': 0.467, 'num': 30},
 'Chemistry': {'acc': 0.3, 'num': 30},
 'Clinical_Medicine': {'acc': 0.433, 'num': 30},
 'Computer_Science': {'acc': 0.4, 'num': 30},
 'Design': {'acc': 0.733, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.367, 'num': 30},
 'Economics': {'acc': 0.3, 'num': 30},
 'Electronics': {'acc': 0.333, 'num': 30},
 'Energy_and_Power': {'acc': 0.467, 'num': 30},
 'Finance': {'acc': 0.333, 'num': 30},
 'Geography': {'acc': 0.433, 'num': 30},
 'History': {'acc': 0.667, 'num': 30},
 'Literature': {'acc': 0.759, 'num': 29},
 'Manage': {'acc': 0.433, 'num': 30},
 'Marketing': {'acc': 0.433, 'num': 30},
 'Materials': {'acc': 0.333, 'num': 30},
 'Math': {'acc': 0.467, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.3, 'num': 30},
 'Music': {'acc': 0.4, 'num': 30},
 'Overall': {'acc': 0.438, 'num': 885},
 'Overall-Art and Design': {'acc': 0.575, 'num': 120},
 'Overall-Business': {'acc': 0.36, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.427, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.605, 'num': 119},
 'Overall-Science': {'acc': 0.407, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.347, 'num': 196},
 'Pharmacy': {'acc': 0.4, 'num': 30},
 'Physics': {'acc': 0.367, 'num': 30},
 'Psychology': {'acc': 0.367, 'num': 30},
 'Public_Health': {'acc': 0.5, 'num': 30},
 'Sociology': {'acc': 0.633, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.438

Generation Model Test

It turns out more flaky than expected. Some default parameter sets didn't run on a A100 x 2 setup. This PR didn't touch the ops layer; not sure what I can help.

Test Detail
MODELS=("Qwen/Qwen2-1.5B" "Qwen/Qwen2.5-14B-Instruct" "HuggingFaceTB/SmolLM-135M-Instruct" "allenai/OLMo-1B-0724-hf" "THUDM/glm-4-9b-chat" "openai-community/gpt2" "microsoft/Phi-3-small-8k-instruct" "allenai/OLMo-2-1124-7B-Instruct" "ibm-granite/granite-3.0-2b-instruct" "mistral-community/pixtral-12b")
TOTAL=0
PASSED=0
FAILED=0

========================================================
Testing model: Qwen/Qwen2-1.5B
========================================================
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
srt_runner: model_path: /workspace/.cache/hub/Qwen/Qwen2-1.5B
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.13s/it]

hf_outputs.output_strs=[' red. Banana is Yellow. Apple is red. Banana is Yellow. Apple is red. Banana is Yellow. Apple is red. Banana is Yellow. Apple is', ' ______.\nA. London\nB. Edinburgh\nC. Manchester\nD. Glasgow\n答案:\nA\n\n下列关于我国古代文学常识的表述,不', ' it. I have a lot of things to do today. I have to go to the bank to get my money. I have to go to the post office', ' the development of intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI is', 'The report provides a comparative analysis of five proposed congressional advisory commissions that would investigate various aspects of the COVID-19 pandemic. The five proposed commissions are found in']
srt_outputs.output_strs=[' red. Banana is Yellow. Apple is red. Banana is Yellow. Apple is red. Banana is Yellow. Apple is red. Banana is Yellow. Apple is', ' ______.\nA. London\nB. Edinburgh\nC. Manchester\nD. Glasgow\n答案:\nA\n\n下列关于我国古代文学常识的表述,不', ' it. I have a lot of things to do today. I have to go to the bank to get my money. I have to go to the post office', ' the development of intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI is', 'The report provides a comparative analysis of five proposed congressional advisory commissions that would investigate various aspects of the COVID-19 pandemic. The five proposed commissions are found in']
rouge_l_scores=[1.0, 1.0, 1.0, 1.0, 1.0]
prefill logprobs max_diff tensor(0.0469)
decode logprobs max_diff tensor(0.0469)
prefill logprobs max_diff tensor(0.0118)
decode logprobs max_diff tensor(0.0312)
prefill logprobs max_diff tensor(0.0130)
decode logprobs max_diff tensor(0.0190)
prefill logprobs max_diff tensor(0.0187)
decode logprobs max_diff tensor(0.0458)
prefill logprobs max_diff tensor(0.0814)
decode logprobs max_diff tensor(0.0312)
.
----------------------------------------------------------------------
Ran 1 test in 36.430s

OK
✅ Test PASSED for Qwen/Qwen2-1.5B
PASSED=1
TOTAL=1

========================================================
Testing model: Qwen/Qwen2.5-14B-Instruct
========================================================
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 8/8 [00:24<00:00,  3.08s/it]
srt_runner: model_path: /workspace/.cache/hub/Qwen/Qwen2.5-14B-Instruct
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:14<00:00,  1.79s/it]

hf_outputs.output_strs=[' London, which is also the largest city in the country. The city has a population of around 9 million people and is located in the southeast of England.', ' to go out for a walk. I have 2 hours of free time. If it takes me 15 minutes to get dressed and ready, and I', ' building smart machines capable of performing tasks that typically require human intelligence. AI systems can be trained to recognize patterns, make decisions, and learn from experience, making them']
srt_outputs.output_strs=[' London, which is also the largest city in the country. The city is located in the southeast of England, on the River Thames. London is a global city', " to go out and play. I have a lot of toys to play with, but I can't decide which one to choose. I have a ball, a", ' building smart machines capable of performing tasks that typically require human intelligence. AI is a broad field that encompasses many subfields, including machine learning, natural language processing,']
rouge_l_scores=[0.6910299003322259, 0.5248868778280543, 0.7277227722772277]
E
======================================================================
ERROR: test_others (test_generation_models.TestGenerationModels.test_others)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/kiv/sglang/python/sglang/srt/utils.py", line 1774, in retry
    return fn()
......
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 1 test in 77.292s

FAILED (errors=1)
❌ Test FAILED for Qwen/Qwen2.5-14B-Instruct
FAILED=1
TOTAL=2

========================================================
Testing model: HuggingFaceTB/SmolLM-135M-Instruct
========================================================
hf_outputs.output_strs=[' London, which is the largest city in the UK and the financial capital of the country.\n\n**Location:** London is situated in the southeastern part of the', ' to play outside. I\'m so excited to be a part of this adventure!\n\n**Theme:** "Wildlife Encounters"\n\n**Target', ' developing intelligent machines that can learn, reason, and interact with humans. It involves the design, development, testing, and evaluation of intelligent systems that can perform tasks']
srt_outputs.output_strs=[' London, which is the largest city in the UK and the financial capital of the country.\n\n**Location:** London is situated in the southeastern part of the', ' to play outside. I\'m so excited to be a part of this adventure!\n\n**Theme:** "Wildlife Encounters"\n\n**Target', ' developing intelligent machines that can learn, reason, and interact with humans. It involves the design, development, testing, and evaluation of intelligent systems that can perform tasks']
rouge_l_scores=[1.0, 1.0, 1.0]
prefill logprobs max_diff tensor(0.0224)
decode logprobs max_diff tensor(0.0620)
E
======================================================================
ERROR: test_others (test_generation_models.TestGenerationModels.test_others)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/kiv/sglang/python/sglang/srt/utils.py", line 1774, in retry
    return fn()
           ^^^^
  File "/workspace/kiv/sglang/python/sglang/test/test_utils.py", line 1020, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: decode logprobs are not all close with model_path=HuggingFaceTB/SmolLM-135M-Instruct prompts=['The capital of the United Kingdom is', 'Today is a sunny day and I like', 'AI is a field of computer science focused on'] decode_tolerance=0.05.hf_logprobs=tensor([[-3.0960e-01, -2.9737e+00, -4.0674e+00, -4.0987e+00, -4.3487e+00],
        [-4.3457e-01, -1.1377e+00, -4.5908e+00, -4.8564e+00, -5.9814e+00],
        ......
        [-3.0525e-01, -1.4615e+00, -3.8052e+00, -5.9302e+00, -6.1959e+00]]), srt_logprobs=tensor([[-3.0884e-01, -2.9729e+00, -4.0667e+00, -4.0979e+00, -4.3557e+00],
        [-4.3459e-01, -1.1377e+00, -4.5908e+00, -4.8565e+00, -5.9971e+00],
        ......
        [-1.2647e-04, -1.0297e+01, -1.0875e+01, -1.1250e+01, -1.1320e+01],
        [-3.0524e-01, -1.4615e+00, -3.8052e+00, -5.9302e+00, -6.1959e+00]])

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/kiv/sglang/python/sglang/test/test_utils.py", line 1019, in _callTestMethod
    retry(
  File "/workspace/kiv/sglang/python/sglang/srt/utils.py", line 1777, in retry
    raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 1 test in 26.392s

FAILED (errors=1)
❌ Test FAILED for HuggingFaceTB/SmolLM-135M-Instruct
FAILED=2
TOTAL=3

========================================================
Testing model: allenai/OLMo-1B-0724-hf
========================================================
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.34s/it]
srt_runner: model_path: /workspace/.cache/hub/allenai/OLMo-1B-0724-hf

hf_outputs.output_strs=[' red. The sky is blue. Grass is green. The sun is yellow. The moon is white. Water is clear. Snow is white.', ' London, which is the capital city of England and the United Kingdom. London is a global city and one of the most important', ' playing with my friends. We can play tag, hide and seek, or maybe even build a fort. I\'m so excited!', ' building machines that can perform tasks that would normally require human intelligence. This includes things like learning, reasoning, problem-solving, perception, and language understanding.', ' provides a comprehensive analysis of the current state of the market, including trends, challenges, and opportunities. The report examines the key drivers of market growth and offers insights into the competitive landscape.']
srt_outputs.output_strs=[' red.\n\nThe sky is blue. Grass is green. The sun is yellow. The moon is white. Water is clear. Snow is white.', ' London, which is the capital of England and the United Kingdom. London is a major city with a rich history and diverse', " playing. I love to play games with my friends. My favorite games are hide and seek, tag, and Simon says. I'm", ' building machines that can think, learn, and make decisions like humans. This field encompasses various disciplines, including computer science, mathematics, linguistics, psychology,', ' provides a comprehensive analysis of the current state of the market for XYZ products. The report includes detailed information on market size,']

srt_logprobs = tensor([[-2.2465e-01, -3.5528e+00, -4.0450e+00],
        [-8.3630e-01, -1.3050e+00, -1.8050e+00],
        [-1.5826e+00, -1.8482e+00, -2.1529e+00],
......
        [-1.9357e-01, -2.2873e+00, -3.7717e+00],
        [-2.4740e-01, -2.0286e+00, -3.4193e+00]])

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/kiv/sglang/python/sglang/test/test_utils.py", line 1019, in _callTestMethod
    retry(
......
Exception: retry() exceed maximum number of retries.

----------------------------------------------------------------------
Ran 1 test in 28.605s

FAILED (errors=1)
❌ Test FAILED for allenai/OLMo-1B-0724-hf
FAILED=3
TOTAL=4

========================================================
Testing model: THUDM/glm-4-9b-chat
========================================================
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 4/4 [00:18<00:00,  4.58s/it]
srt_runner: model_path: /workspace/.cache/hub/THUDM/glm-4-9b-chat

[2025-04-08 10:10:17 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/workspace/kiv/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
......
ValueError: Model architectures ['GlmForCausalLM'] are not supported for now. Supported architectures: dict_keys(['BaichuanForCausalLM', 'ChatGLMModel', 'CLIPModel', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'DbrxForCausalLM', 'DeepseekForCausalLM', 'MultiModalityCausalLM', 'DeepseekV3ForCausalLMNextN', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'DeepseekVL2ForCausalLM', 'ExaoneForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma2ForSequenceClassification', 'Gemma3ForCausalLM', 'Gemma3ForConditionalGeneration', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GraniteForCausalLM', 'Grok1ForCausalLM', 'Grok1ModelForCausalLM', 'InternLM2ForCausalLM', 'InternLM2ForRewardModel', 'LlamaForCausalLM', 'Phi3ForCausalLM', 'InternLM3ForCausalLM', 'Llama4ForCausalLM', 'LlamaForClassification', 'LlamaForCausalLMEagle', 'LlamaForCausalLMEagle3', 'LlamaEmbeddingModel', 'MistralModel', 'LlamaForSequenceClassification', 'LlamaForSequenceClassificationWithNormal_Weights', 'LlavaLlamaForCausalLM', 'LlavaQwenForCausalLM', 'LlavaMistralForCausalLM', 'LlavaForConditionalGeneration', 'LlavaVidForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'MiniCPMO', 'MiniCPMV', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MllamaForConditionalGeneration', 'Llama4ForConditionalGeneration', 'OlmoForCausalLM', 'Olmo2ForCausalLM', 'OlmoeForCausalLM', 'Phi3SmallForCausalLM', 'PixtralVisionModel', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2_5_VLForConditionalGeneration', 'Qwen2ForSequenceClassification', 'Qwen2ForCausalLMEagle', 'Qwen2MoeForCausalLM', 'Qwen2ForRewardModel', 'Qwen2VLForConditionalGeneration', 'StableLmForCausalLM', 'TorchNativeLlamaForCausalLM', 'TorchNativePhi3ForCausalLM', 'XverseForCausalLM', 'XverseMoeForCausalLM', 'YiVLForCausalLM'])

[2025-04-08 10:10:17] Received sigquit from a child process. It usually means the child failed.
❌ Test FAILED for THUDM/glm-4-9b-chat
FAILED=4
TOTAL=5

========================================================
Testing model: openai-community/gpt2
========================================================
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [348,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [348,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [348,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Process Process-1:
Traceback (most recent call last):
  File "/root/miniconda3/envs/kgl/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 829, in forward
    attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kgl/lib/python3.11/site-packages/transformers/modeling_attn_mask_utils.py", line 378, in _prepare_4d_causal_attention_mask_for_sdpa
    ignore_causal_mask = AttentionMaskConverter._ignore_causal_mask_sdpa(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/kgl/lib/python3.11/site-packages/transformers/modeling_attn_mask_utils.py", line 288, in _ignore_causal_mask_sdpa
    elif not is_tracing and torch.all(attention_mask == 1):
  File "/root/miniconda3/envs/kgl/lib/python3.11/site-packages/torch/utils/_device.py", line 106, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

❌ Test FAILED for microsoft/Phi-3-small-8k-instruct
FAILED=6
TOTAL=7

========================================================
Testing model: allenai/OLMo-2-1124-7B-Instruct
========================================================
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.41s/it]
srt_runner: model_path: /workspace/.cache/hub/allenai/OLMo-2-1124-7B-Instruct
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  4.25s/it]

hf_outputs.output_strs=[' London, England. London is the largest city in the UK and one of the most populous cities in Europe. It is also a major global city and financial center', ' to go out for a walk in the park. I usually walk for about 30 minutes, covering a distance of 5 kilometers. Today, I decide to', ' creating intelligent machines that can perform tasks that typically require human intelligence. This includes tasks suchassistant as visual perception, speech recognition, decision-making, and language translation.\n\n']
srt_outputs.output_strs=[' London, England. London is the largest city in the UK and one of the most populous cities in Europe. It is also a major global city and financial center', ' to go out for a walk in the park. I usually walk for about 30 minutes, covering a distance of 5 kilometers. Today, I decide to', ' creating intelligent machines that can perform tasks that typically require human intelligence. This includes tasks suchassistant as visual perception, speech recognition, decision-making, and language translation.\n\n']
rouge_l_scores=[1.0, 1.0, 1.0]
prefill logprobs max_diff tensor(0.0140)
decode logprobs max_diff tensor(0.0290)
prefill logprobs max_diff tensor(0.0153)
decode logprobs max_diff tensor(0.0146)
prefill logprobs max_diff tensor(0.0151)
decode logprobs max_diff tensor(0.0253)
.
----------------------------------------------------------------------
Ran 1 test in 59.573s

OK
✅ Test PASSED for allenai/OLMo-2-1124-7B-Instruct
PASSED=2
TOTAL=8

========================================================
Testing model: ibm-granite/granite-3.0-2b-instruct
========================================================
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.86s/it]
srt_runner: model_path: /workspace/.cache/hub/ibm-granite/granite-3.0-2b-instruct
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.22s/it]

hf_outputs.output_strs=[' London.\n\nThe capital of the United Kingdom is London.\n\nThe capital of the United Kingdom is London.\n\nThe capital of the', ' to go for a walk in the park. I can see the green grass, the colorful flowers, and the beautiful trees. I can hear', ' creating intelligent systems that can understand, learn, and make decisions based on data. Here are some key aspects of AI:\n\n1. **Machine Learning']
srt_outputs.output_strs=[' London.\n\nThe capital of the United Kingdom is London.\n\nThe capital of the United Kingdom is London.\n\nThe capital of the', ' to go for a walk in the park. I can see the green grass, the colorful flowers, and the beautiful trees. I can hear', ' creating intelligent systems that can understand, learn, and make decisions based on data. Here are some key aspects of AI:\n\n1. **Machine Learning']
rouge_l_scores=[1.0, 1.0, 1.0]
prefill logprobs max_diff tensor(0.0207)
decode logprobs max_diff tensor(0.0312)
prefill logprobs max_diff tensor(0.0156)
decode logprobs max_diff tensor(0.0312)
prefill logprobs max_diff tensor(0.0209)
decode logprobs max_diff tensor(0.0303)
.
----------------------------------------------------------------------
Ran 1 test in 38.153s

OK
✅ Test PASSED for ibm-granite/granite-3.0-2b-instruct
PASSED=3
TOTAL=9

The last one is tested separately - mistral-community/pixtral-12b

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 6/6 [00:16<00:00,  2.74s/it]
srt_runner: model_path: /workspace/.cache/hub/mistral-community/pixtral-12b
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:12<00:00,  2.02s/it]

hf_outputs.output_strs=[' London. It is the largest city in the UK and is located in the southeast of England. London is a global city and has been a major center of finance', ' to go out and enjoy the sun. I like to go to the park and play with my friends. I like to go to the playground and play on the', ' creating intelligent machines that can perform tasks that typically require human intelligence. These tasks include speech recognition, decision-making, visual perception, and language translation, among others.\n\n']
srt_outputs.output_strs=[' London. It is the largest city in the UK and is located in the southeast of England. London is a global city and has been a major center of finance', ' to go out and enjoy the sun. I like to go to the park and play with my friends. I like to go to the playground and play on the', ' creating intelligent machines that can perform tasks that typically require human intelligence. These tasks include speech recognition, decision-making, visual perception, and language translation, among others.\n\n']
rouge_l_scores=[1.0, 1.0, 1.0]
prefill logprobs max_diff tensor(0.0199)
decode logprobs max_diff tensor(0.0291)
prefill logprobs max_diff tensor(0.0127)
decode logprobs max_diff tensor(0.0174)
prefill logprobs max_diff tensor(0.0155)
decode logprobs max_diff tensor(0.0312)
.
----------------------------------------------------------------------
Ran 1 test in 68.452s

OK

@KivenChen KivenChen requested a review from zhaochenyang20 April 8, 2025 18:26
@KivenChen KivenChen requested a review from zhaochenyang20 April 9, 2025 00:12
@zhaochenyang20
Copy link
Collaborator

cc @yhyang201 could you take a look plz?

@KivenChen
Copy link
Contributor Author

multimodal part just aligned

@KivenChen KivenChen requested a review from ch-wan as a code owner April 30, 2025 08:11
@KivenChen KivenChen force-pushed the kiv__p1xtral_hf branch 6 times, most recently from a786e5b to 49c87e7 Compare May 4, 2025 00:49
tinymitt and others added 10 commits May 13, 2025 05:55
revert less relevant

separate vlm tests

chore: mm data padding cls rename
fix

extend ci timeout

revert tle extension
refactor mm data hashing

Revert "refactor mm data hashing"

This reverts commit a8b29a81a2003e1b63d8cd32175330bef1d57170.

revert irrelevant
This reverts commit 46ea1868a44d9943e47514d2f73b7e9c6ec250d3.
@JustinTong0323
Copy link
Collaborator

MMMU bench result:

{'Accounting': {'acc': 0.4, 'num': 30},
 'Agriculture': {'acc': 0.562, 'num': 16},
 'Architecture_and_Engineering': {'acc': 0.333, 'num': 30},
 'Art': {'acc': 0.667, 'num': 30},
 'Art_Theory': {'acc': 0.7, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.467, 'num': 30},
 'Biology': {'acc': 0.5, 'num': 30},
 'Chemistry': {'acc': 0.3, 'num': 30},
 'Clinical_Medicine': {'acc': 0.5, 'num': 30},
 'Computer_Science': {'acc': 0.4, 'num': 30},
 'Design': {'acc': 0.6, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.467, 'num': 30},
 'Economics': {'acc': 0.4, 'num': 30},
 'Electronics': {'acc': 0.267, 'num': 30},
 'Energy_and_Power': {'acc': 0.433, 'num': 30},
 'Finance': {'acc': 0.233, 'num': 30},
 'Geography': {'acc': 0.4, 'num': 30},
 'History': {'acc': 0.8, 'num': 30},
 'Literature': {'acc': 0.862, 'num': 29},
 'Manage': {'acc': 0.4, 'num': 30},
 'Marketing': {'acc': 0.4, 'num': 30},
 'Materials': {'acc': 0.233, 'num': 30},
 'Math': {'acc': 0.4, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.3, 'num': 30},
 'Music': {'acc': 0.433, 'num': 30},
 'Overall': {'acc': 0.463, 'num': 885},
 'Overall-Art and Design': {'acc': 0.6, 'num': 120},
 'Overall-Business': {'acc': 0.367, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.473, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.706, 'num': 119},
 'Overall-Science': {'acc': 0.4, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.347, 'num': 196},
 'Pharmacy': {'acc': 0.467, 'num': 30},
 'Physics': {'acc': 0.4, 'num': 30},
 'Psychology': {'acc': 0.6, 'num': 30},
 'Public_Health': {'acc': 0.467, 'num': 30},
 'Sociology': {'acc': 0.567, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.463

@JustinTong0323 JustinTong0323 added the ready-to-merge The PR is ready to merge after the CI is green. label May 13, 2025
@zhyncs zhyncs merged commit 5380cd7 into sgl-project:main May 13, 2025
29 checks passed
@zhaochenyang20
Copy link
Collaborator

great work!

@KivenChen KivenChen deleted the kiv__p1xtral_hf branch May 15, 2025 00:51
lifuhuang pushed a commit to lifuhuang/sglang that referenced this pull request May 17, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 23, 2025
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <[email protected]>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <[email protected]>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <[email protected]>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <[email protected]>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <[email protected]>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857)

* simplify fused_moe config logging (sgl-project#5801)

* [CI] tune the test order to warmup the server (sgl-project#5860)

* Cutlass MLA decode - fix dtype error (sgl-project#5868)

* cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

* [Feature] support auto chat template (sgl-project#4949)

* Feat: support cuda graph for LoRA (sgl-project#4115)

Co-authored-by: Beichen Ma <[email protected]>

* Add qwen3 30b fused moe config (sgl-project#5859)

* [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875)

Co-authored-by: pengcuo <[email protected]>

* Add A800 fused moe config for qwen3 30b (sgl-project#5880)

* [Misc] add service discovery for sgl router

* [fix]: PyO3 macOS linking and consolidate on tracing for logging

* chore: update Dockerfile (sgl-project#5894)

* [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836)

* [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841)

* chore: update CODEOWNERS (sgl-project#5895)

* [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746)

* [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896)

* Auto set draft model path for MTP (sgl-project#5793)

* [fix] relax mem_fraction_static for h200 (sgl-project#5893)

Co-authored-by: alcanerian <[email protected]>

* feat: support pythonic tool call and index in tool call streaming (sgl-project#5725)

* [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696)

* Add AMD MI300x Nightly Testing. (sgl-project#5861)

* chore: use torch 2.6 for sgl-kernel build (sgl-project#5898)

* Fix check_env script (sgl-project#5901)

* [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830)

* Bump Flashinfer to 0.2.5 (sgl-project#5870)

Co-authored-by: Yuhao Chen <[email protected]>

* [Fix] Unload lora in HF_Runner if needed (sgl-project#5899)

* Add A800 fused moe config for qwen3 235b (sgl-project#5900)

* Add sm_120 for blackwell (sgl-project#5903)

* [Feature] add support kimi vl model (sgl-project#5383)

Co-authored-by: wenju.li <[email protected]>

* support vlm benchmark profile (sgl-project#5905)

* [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910)

* [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919)

* [qwen3] support qwen3 ep moe (sgl-project#5917)

Co-authored-by: sleepcoo <[email protected]>

* Add TP2 MOE benchmarks for AMD. (sgl-project#5909)

* [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912)

Co-authored-by: zhyncs <[email protected]>

* chore: bump sgl-kernel 0.1.1 (sgl-project#5932)

* chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933)

* Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783)

* [PP] Add pipeline parallelism (sgl-project#5724)

* Fix lora batch processing when input lora_path contains None (sgl-project#5930)

* add Thor & Spark (sgl-project#5915)

* fix: correct stream response when enable_thinking is set to false (sgl-project#5881)

* fix: update model runner (sgl-project#5934)

* chore: bump v0.4.6.post2 (sgl-project#5939)

* Support XiaomiMiMo/MiMo model inference (sgl-project#5921)

* [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834)

Co-authored-by: luoyuan.luo <[email protected]>

* Remove extra contiguous (sgl-project#5953)

* Update ci test and doc for MTP api change (sgl-project#5952)

* docs: Fix Qwen model typo (sgl-project#5944)

Signed-off-by: JiangJiaWei1103 <[email protected]>

* Optimize a pad operation to accelerate 25us (sgl-project#5945)

* Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956)

* feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782)

* Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960)

* feat: Refactor DeepSeekV3 function call (sgl-project#5908)

* Remove token in token out in Native API (sgl-project#5967)

* Support InternVL3 (sgl-project#5350)

Co-authored-by: Mick <[email protected]>
Co-authored-by: Chayenne <[email protected]>

* Support MMMU benchmark for  InternVL (sgl-project#5968)

* FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969)

Signed-off-by: Lifu Huang <[email protected]>

* [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681)

* Fix set kv cache multi-stream (sgl-project#5975)

* Overlap qk norm with two streams (sgl-project#5977)

* fix: only upgrade nccl for cu128 (sgl-project#5986)

* Fix Phi3 serving which was broke by earlier change (sgl-project#5991)

Co-authored-by: Lifu Huang <[email protected]>

* [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998)

* [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992)

* [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012)

Signed-off-by: Lifu Huang <[email protected]>

* Fix flaky issues of lora and add multi batch tests (sgl-project#5957)

* Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679)

* fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997)

* [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002)

* Update dev container config to support live code sync and improve docker setup guide   (sgl-project#6018)

Signed-off-by: Lifu Huang <[email protected]>

* [PD] Optimize disaggregation ib device help info (sgl-project#5781)

* [Test] Add flashmla attention backend test (sgl-project#5587)

* Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555)

* feat: Add a unified merge_state API (sgl-project#5428)

* feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996)

* [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752)

* Fix prefill OOM error in the case of large page size (sgl-project#5081)

* Fix problem of large page size with chunked prefill (sgl-project#6046)

* docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047)

* docs: add new blog (sgl-project#6048)

* Fix not "import os" (sgl-project#6057)

* Better PD initialization (sgl-project#5751)

* fix: deepep dockerfile, use pip install deepep. (sgl-project#5885)

* [Fix] Fix and rename flashmla CI test (sgl-project#6045)

* chore: upgrade cutlass 3.9.2 (sgl-project#6004)

Co-authored-by: yizhang2077 <[email protected]>

* Fix sgl-kernel build on aarch64 platforms (sgl-project#6062)

* Add DeepEP to CI PR Test (sgl-project#5655)

Co-authored-by: Jinyan Chen <[email protected]>

* fix custom_allreduce namespace (sgl-project#6039)

* feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010)

Co-authored-by: Qiaolin-Yu <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* [Feature] Support for Ascend NPU backend (sgl-project#3853)

Signed-off-by: Song Zhang <[email protected]>
Co-authored-by: 22dimensions <[email protected]>

* Fix the timeout for 8 gpu tests (sgl-project#6084)

* Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014)

* Super tiny fix doc (sgl-project#5233)

* [Doc]Fix description for dp_size argument (sgl-project#6063)

* feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075)

* [refactor] slightly tidy fp8 module (sgl-project#5993)

* Clean up fa3 test from 8 gpus (sgl-project#6105)

* Deferring 8 GPU test (sgl-project#6102)

* Update doc for MLA attention backends (sgl-project#6034)

* Clean logs for DeepSeek-V3 launching (sgl-project#6079)

* [CI]Add performance CI for VLM (sgl-project#6038)

Signed-off-by: Xinyuan Tong <[email protected]>

* adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111)

* optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077)

* Overlap shared expert and routed expert computations (sgl-project#5121)

* Tiny refactor ModelConfig.from_server_args (sgl-project#5219)

* Tiny refactor weight loading logic (sgl-project#5232)

* [PD] Add control to slow down a server (sgl-project#5572)

* Change AMD test threshold (sgl-project#6091)

* DeepEP normal support deepgemm-contiguous (sgl-project#5626)

Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Xuting Zhou <[email protected]>
Co-authored-by: ZhengHSI <[email protected]>

* [fix] fix pyproject.toml dependencies (sgl-project#6119)

* [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764)

Co-authored-by: othame <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>

* [perf] dsv3 bmm fallback to bf16 (sgl-project#5662)

* [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097)

* [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123)

Co-authored-by: zhyncs <[email protected]>

* upgrade xgrammar to 0.1.19 (sgl-project#6129)

* Remove unecessary is_fa3_supported check (sgl-project#6112)

* chore: bump sgl-kernel 0.1.2 (sgl-project#6131)

* docs: update README (sgl-project#6132)

* [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745)

* Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101)

* opt flashinfer mla cat (sgl-project#5822)

Co-authored-by: xuyongfei.xyf <[email protected]>

* Update amd nightly concurrency. (sgl-project#6141)

* feat: add thinking_budget (sgl-project#6089)

* [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162)

* fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778)

Co-authored-by: Zhiqiang Xie <[email protected]>

* chore: bump v0.4.6.post3 (sgl-project#6165)

* KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016)

Co-authored-by: 继优 <[email protected]>
Co-authored-by: chus-chus <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>

* [fix] fix determine_n_share_experts_fusion (sgl-project#6118)

* Fix and Clean up chat-template requirement for VLM (sgl-project#6114)

Signed-off-by: Xinyuan Tong <[email protected]>

* [Docs]Delete duplicate content (sgl-project#6146)

Co-authored-by: ximing.wxm <[email protected]>

* Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181)

* Added async_encode method to Engine (sgl-project#4701)

* Fix data parallel perf regression (sgl-project#6183)

* Fix request abortion (sgl-project#6184)

* Add typo checker in pre-commit (sgl-project#6179)

Co-authored-by: Brayden Zhong <[email protected]>

* Remove duplicate IO Struct test (sgl-project#6180)

Signed-off-by: Emmanuel Ferdman <[email protected]>

* [PD] Add simple unit test for disaggregation feature (sgl-project#5654)

Signed-off-by: Shangming Cai <[email protected]>

* [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186)

* feat: support loogle eval (sgl-project#6190)

* [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191)

* fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169)

* chore: upgrade deepgemm (sgl-project#6073)

* chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195)

* chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196)

Co-authored-by: alcanderian <[email protected]>

* Handle empty input string for embedding models (sgl-project#5621)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199)

* [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032)

* Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188)

* [CI] Reorganize the 8 gpu tests (sgl-project#6192)

* Add dev-deepep docker image (sgl-project#6198)

* Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178)

Signed-off-by: Lifu Huang <[email protected]>

* Update README.md (sgl-project#6202)

* Fix release-docs.yml to not use python 3.9 (sgl-project#6204)

* Fix start_profile does not support with_stack and record_shapes (sgl-project#6043)

* [doc] add a note for --n-share-experts-fusion args (sgl-project#6154)

* Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558)

Co-authored-by: liusy58 <[email protected]>

* Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213)

* Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201)

Co-authored-by: SangBin Cho <[email protected]>

* [CI] Fix PD mooncake dependency error (sgl-project#6212)

Signed-off-by: Shangming Cai <[email protected]>

* [CI] Re-enable pd disaggregation test (sgl-project#6231)

Signed-off-by: Shangming Cai <[email protected]>

* fix some typos (sgl-project#6209)

Co-authored-by: Brayden Zhong <[email protected]>

* [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206)

* [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223)

* Revert "fix some typos" (sgl-project#6244)

* chore: add hf_xet dep (sgl-project#6243)

* Update AMD nightly deps. (sgl-project#6241)

* [PD] Add support for different TP sizes per DP rank (sgl-project#5922)

Signed-off-by: Shangming Cai <[email protected]>

* Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225)

Co-authored-by: SangBin Cho <[email protected]>

* fix typo (sgl-project#6248)

* Support tuning moe for llama 4 model (sgl-project#6042)

* Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251)

* [Llama4] Add docs note about enable multimodal (sgl-project#6235)

* [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247)

* Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657)

Co-authored-by: liusy58 <[email protected]>
Co-authored-by: 颉沆 <[email protected]>

* model(vlm): pixtral (sgl-project#5084)

* [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252)

* Enable MI325X AMD CI. (sgl-project#6259)

* chore: bump v0.4.6.post4 (sgl-project#6245)

* formatting fix for the rebased commit for 4.6.0_post4

Signed-off-by: Mohit Sinha <[email protected]>

* fix issues in model runner and python packages

fix for following issues:
> vLLM dependency for xgrammar==0.1.17
> 'Scheduler' object has no attribute 'device
> 'pp_proxy_tensors' unexpected arg in HPUGraphRunner
> TODO: Add pipeline parallelism support in HPUGraphRunner

Signed-off-by: Mohit Sinha <[email protected]>

* fix formatting in model runner

Signed-off-by: Mohit Sinha <[email protected]>

* base grammar fix for the is_terminated case

>  'OutlinesGrammar' object has no attribute 'is_terminated'

Signed-off-by: Mohit Sinha <[email protected]>

---------

Signed-off-by: Kebe <[email protected]>
Signed-off-by: congcongke <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: Lifu Huang <[email protected]>
Signed-off-by: Song Zhang <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Emmanuel Ferdman <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Mohit Sinha <[email protected]>
Co-authored-by: Wenxuan Tan <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: saltyfish66 <[email protected]>
Co-authored-by: vzed <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: saienduri <[email protected]>
Co-authored-by: DavidBao <[email protected]>
Co-authored-by: Frankey_8080 <[email protected]>
Co-authored-by: Stefan He <[email protected]>
Co-authored-by: yan97ao <[email protected]>
Co-authored-by: aoshen524 <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: lambert0312 <[email protected]>
Co-authored-by: Kebe <[email protected]>
Co-authored-by: zhanweidu <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Huapeng Zhou <[email protected]>
Co-authored-by: NoahM <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: Michael Yao <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: JiLi <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: PGFLMG <[email protected]>
Co-authored-by: sighingnow <[email protected]>
Co-authored-by: XTY <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: woodx <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: Beichen Ma <[email protected]>
Co-authored-by: pengcuo <[email protected]>
Co-authored-by: pengcuo <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>
Co-authored-by: simveit <[email protected]>
Co-authored-by: Johnny <[email protected]>
Co-authored-by: alcanerian <[email protected]>
Co-authored-by: Yuhao Chen <[email protected]>
Co-authored-by: zhjunqin <[email protected]>
Co-authored-by: liwenju0 <[email protected]>
Co-authored-by: wenju.li <[email protected]>
Co-authored-by: laixin <[email protected]>
Co-authored-by: sleepcoo <[email protected]>
Co-authored-by: Ying Sheng <[email protected]>
Co-authored-by: ryang <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: 江家瑋 <[email protected]>
Co-authored-by: KCFindstr <[email protected]>
Co-authored-by: xm:D <[email protected]>
Co-authored-by: Lifu Huang <[email protected]>
Co-authored-by: Yongtong Wu <[email protected]>
Co-authored-by: Junrong Lin <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: DefTruth <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: Hank Han <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: Jinyan Chen <[email protected]>
Co-authored-by: Jinyan Chen <[email protected]>
Co-authored-by: Johnny <[email protected]>
Co-authored-by: Song Zhang <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: ishandhanani <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Minglei Zhu <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Xuting Zhou <[email protected]>
Co-authored-by: ZhengHSI <[email protected]>
Co-authored-by: Zhu Chen <[email protected]>
Co-authored-by: othame <[email protected]>
Co-authored-by: Hubert Lu <[email protected]>
Co-authored-by: Yixin Dong <[email protected]>
Co-authored-by: xu-yfei <[email protected]>
Co-authored-by: xuyongfei.xyf <[email protected]>
Co-authored-by: thyecust <[email protected]>
Co-authored-by: huangtingwei <[email protected]>
Co-authored-by: Simon (Jiyou) Li <[email protected]>
Co-authored-by: 继优 <[email protected]>
Co-authored-by: chus-chus <[email protected]>
Co-authored-by: Ximingwang-09 <[email protected]>
Co-authored-by: ximing.wxm <[email protected]>
Co-authored-by: Steven Shimizu <[email protected]>
Co-authored-by: applesaucethebun <[email protected]>
Co-authored-by: Brayden Zhong <[email protected]>
Co-authored-by: Emmanuel Ferdman <[email protected]>
Co-authored-by: Yusong Gao <[email protected]>
Co-authored-by: alcanderian <[email protected]>
Co-authored-by: Ravi Theja <[email protected]>
Co-authored-by: Ravi Theja Desetty <[email protected]>
Co-authored-by: liusy58 <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: 颉沆 <[email protected]>
Co-authored-by: Kiv Chen <[email protected]>
woodx9 pushed a commit to woodx9/sglang that referenced this pull request May 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority ready-to-merge The PR is ready to merge after the CI is green. visIon-LM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants