Skip to content

Tiny fix DeepseekScalingRotaryEmbedding always use forward_native #5406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 15, 2025

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Apr 15, 2025

Motivation

Modifications

Checklist

@fzyzcjy fzyzcjy changed the title Fix DeepseekScalingRotaryEmbedding always use forward_native Tiny fix DeepseekScalingRotaryEmbedding always use forward_native Apr 15, 2025
@zhyncs
Copy link
Member

zhyncs commented Apr 15, 2025

cc @ispobock @Fridge003 @strgrb

@fzyzcjy May you paste the torch profile result

@zhyncs zhyncs added high priority bug Something isn't working labels Apr 15, 2025
@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Apr 15, 2025

OK will run that

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Apr 15, 2025

image

@zhyncs zhyncs merged commit 15e91d7 into main Apr 15, 2025
34 of 38 checks passed
@zhyncs zhyncs deleted the fzyzcjy-patch-7 branch April 15, 2025 08:33
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025
* Support with_stack and record_shapes in profiler (sgl-project#4740)

Co-authored-by: Lianmin Zheng <[email protected]>

* test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840)

* Fix CI tests (sgl-project#4853)

* Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855)

* Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863)

* [Feature] add multi-rank support for Lora (sgl-project#4492)

Co-authored-by: rudy152 <[email protected]>

* Clean up `import vllm` in quantization/__init__.py (sgl-project#4834)

* Fix wrong variable name when stopping memory profile (sgl-project#4772)

* [Feat] support deepgemm for cmake (sgl-project#4864)

* Make torch compile configurable for biased_grouped_topk (sgl-project#4749)

* update sgl-kernel test ci (sgl-project#4866)

* fix sampling issue (sgl-project#4871)

* bump sgl-kernel 0.0.5.post4 (sgl-project#4768)

* fix sgl-kernel cu118 build (sgl-project#4872)

* [Feature] Support FA3 backend for MLA (sgl-project#4831)

* upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873)

* update torch compile doc (sgl-project#4874)

* bump v0.4.4.post3 (sgl-project#4878)

* Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882)

* Improve stack trace of retry errors (sgl-project#4845)

* Tiny fix doc error (sgl-project#4795)

* [Docs] Update DeepGEMM at README.md (sgl-project#4886)

* Update CODEOWNERS (sgl-project#4889)

* Delete test_deep_gemm.py (sgl-project#4891)

* Add deepseek style fused moe group gate selection kernel (sgl-project#4530)

* quick fix: add default for new kernel (sgl-project#4898)

* remove setup for sgl-kernel (sgl-project#4899)

* [Misc] Clean m.def and add Development Tips (sgl-project#4890)

* fix allreduce test (sgl-project#4909)

* Support page size > 1 + eagle (sgl-project#4908)

* Fix retract for page size > 1 (sgl-project#4914)

* [Feature] use pytest for sgl-kernel (sgl-project#4896)

* fix bmm fp8 (sgl-project#4926)

* Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927)

* Fix 2-gpu CI test and suppress some warnings (sgl-project#4930)

* [feat] add fa3 in sgl-kernel (sgl-project#4902)

Co-authored-by: Sleepcoo <[email protected]>

* Fix sglang frontend's incorrect dependency on torch (sgl-project#4931)

* [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932)

* cleanup sgl-kernel (sgl-project#4933)

* [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925)

* Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883)

Co-authored-by: ch-wan <[email protected]>

* [Fix] Add torch compile for torch.clamp back (sgl-project#4936)

* Fix oom error for large page size (sgl-project#4913)

Co-authored-by: Lianmin Zheng <[email protected]>

* [feat] interface for platforms abstraction (sgl-project#4928)

* [Fix] revert clean m.def for cudagraph (sgl-project#4944)

* refactor: multimodal data (sgl-project#4754)

* bump sgl-kernel v0.0.6 (sgl-project#4950)

* [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953)

* use fa3 in sgl-kernel (sgl-project#4954)

* Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959)

* [Feature] Support DeepEP Low Latency (sgl-project#4767)

Co-authored-by: sleepcoo <[email protected]>
Co-authored-by: laixinn <[email protected]>
Co-authored-by: ch-wan <[email protected]>

* update bench_serving (sgl-project#4958)

* Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977)

* [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915)

Signed-off-by: Xinyuan Tong <[email protected]>
Co-authored-by: GeLee <[email protected]>

* Large page size aligned hierarchical caching (sgl-project#4581)

* bug fix for hicache host eviction (sgl-project#4989)

* sgl scaled_fp8_quant support output padding (sgl-project#4861)

* Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951)

Co-authored-by: hebiao064 <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: zcnrex <[email protected]>

* Update tokenizer_manager.py (sgl-project#5008)

* [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817)

* update cutlass tag (sgl-project#5011)

* Feature/revise docs ci (sgl-project#5009)

* fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727)

Co-authored-by: yuethe <[email protected]>

* [Build] Support build sgl-kernel with ccache (sgl-project#5020)

* fix deepgemm as well (sgl-project#5030)

* try to fix ci oserror (sgl-project#5024)

* Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005)

* Small refactor DeepEPMode to clean up code a bit (sgl-project#4992)

* [Fix] fix fa3 build at cu118 (sgl-project#5036)

* Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048)

* bump sgl-kernel v0.0.7 (sgl-project#5046)

* update eagle-3 docs (sgl-project#4796)

Co-authored-by: Yifan Zhang <[email protected]>

* Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* Update the retry count (sgl-project#5051)

* upgrade sgl-kernel v0.0.7 (sgl-project#5049)

* [2/3] fix dsv3 awq issue  (sgl-project#4625)

Co-authored-by: 晟海 <[email protected]>
Co-authored-by: laixinn <[email protected]>

* Feature/revise docs ci (sgl-project#5056)

* Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057)

* [fix] remove `cuda_device_count_stateless` (sgl-project#5060)

* Small refactor DeepEPDispatcher into subclasses (sgl-project#4994)

* Support async DeepEP by splitting into two stages (sgl-project#4995)

* Cleanup unused resources after DeepEP operation (sgl-project#4996)

* Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

* [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072)

* fix dummy-load deepseekv2 (sgl-project#4535)

* support sgl-kernel on blackwell (sgl-project#5074)

* FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050)

Co-authored-by: Qingquan Song <[email protected]>
Co-authored-by: Chunan Zeng <[email protected]>

* [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052)

* upgrade transformers 4.51.0 (sgl-project#5088)

* sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079)

* bump sgl-kernel 0.0.8 (sgl-project#5089)

* python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080)

* bump v0.4.4.post4 (sgl-project#5091)

* Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097)

Co-authored-by: shuaills <[email protected]>

* Add Llama4 support (sgl-project#5092)

Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: ispobock <[email protected]>

* Fix refactor error - fp8.py (sgl-project#5106)

Co-authored-by: Lianmin Zheng <[email protected]>

* bump v0.4.5 (sgl-project#5117)

* [ci] fix llama4 ci error (sgl-project#5126)

* Refactor and Optimize FA3 Code (sgl-project#5090)

Co-authored-by: Qingquan Song <[email protected]>

* Add Llama4 user guide (sgl-project#5133)

Co-authored-by: Cheng Wan <[email protected]>

* [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137)

* feat: disable grammar restrictions within reasoning sections (sgl-project#4984)

Co-authored-by: tianhaoyu <[email protected]>
Co-authored-by: DarkSharpness <[email protected]>

* [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145)

* [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140)

* fix multimodal hash feature (sgl-project#5083)

* Fix run time error in ROCm platform (sgl-project#5147)

Co-authored-by: wunhuang <[email protected]>
Co-authored-by: root <[email protected]>

* [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103)

* Add unit test on page_size > 1 and mla and  integration test for Flash Attention 3 (sgl-project#4760)

* Use public model for FA3 speculative decode testing (sgl-project#5152)

* Add dummy grok test to amd CI. (sgl-project#5115)

* fix empty_cache error in pt_weights_iterator (sgl-project#5151)

Co-authored-by: dangkai.dk <[email protected]>

* Fix torch compile errors (sgl-project#5158)

* Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686)

Co-authored-by: qingquansong <[email protected]>

* [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155)

Signed-off-by: Shangming Cai <[email protected]>

* Add optimized native kernels in sgl-kernel (sgl-project#5150)

Co-authored-by: Chunyuan WU <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: blzheng <[email protected]>

* [PD] Simplify mini LB (sgl-project#4911)

Co-authored-by: Liangsheng Yin <[email protected]>

* Small improvement of native api docs (sgl-project#5139)

Co-authored-by: zhaochenyang20 <[email protected]>

* [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938)

Signed-off-by: Xinyuan Tong <[email protected]>

* Support 2x8xH100 for Llama 4 (sgl-project#5159)

* FP4 weight loading and inference (2/2) (sgl-project#3972)

* Fix multimodal hashing error (sgl-project#5174)

* Tiny disable model that does not work (sgl-project#5175)

* [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173)

* [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068)

Co-authored-by: ch-wan <[email protected]>

* docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110)

Signed-off-by: Kay Yan <[email protected]>

* feat: add DeepGEMM build warning (sgl-project#5176)

Co-authored-by: grimoire <[email protected]>

* fix: use DeepEPDispatcher on CUDA (sgl-project#5180)

* [DeepEP] fix: import buffer error (sgl-project#5179)

* Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058)

* [Misc] clean up vllm in sgl-kernel test (sgl-project#5189)

* Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185)

Co-authored-by: wunhuang <[email protected]>

* Optimize topk operation in llama4 (sgl-project#5128)

* Support Llama4 fp8 inference (sgl-project#5194)

Co-authored-by: laixinn <[email protected]>
Co-authored-by: sleepcoo <[email protected]>
Co-authored-by: zhyncs <[email protected]>

* [ci] fix ci test fused_moe op (sgl-project#5102)

* model: support mllama4 (sgl-project#5144)

* update grok test (sgl-project#5171)

* sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207)

* Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196)

* fix: log warning when disable cuda graph (sgl-project#5209)

* [metrics] Add in queue metrics (sgl-project#4444)

* Fix DeepSeek error when using DeepEP mode (sgl-project#5190)

* reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086)

* [PD] Support KV transfer with mooncake (sgl-project#4880)

Signed-off-by: Shangming Cai <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>
Co-authored-by: Xuchun Shang <[email protected]>
Co-authored-by: shangmingc <[email protected]>

* [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204)

* Update deps for mllama4 (sgl-project#5215)

* Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213)

* ROCm sgl-kernel: compatible to later torch (sgl-project#5167)

* [Misc] Clean sgl-kernel test  (sgl-project#5216)

* Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245)

* Fix torch.compile cacheing (sgl-project#5259)

Co-authored-by: zhyncs <[email protected]>

* ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228)

* Optimize attention in llama4 (sgl-project#5127)

* Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262)

Co-authored-by: ch-wan <[email protected]>

* Support `--enable-llama4-multimodal` (sgl-project#5254)

* [fix] fix mrope positions not picked up (sgl-project#5265)

* doc: nested loop code for offline engine (sgl-project#5244)

* fix: examples for token_in_token_out_vlm  (sgl-project#5193)

* Fix a 404 link in send_request.ipynb (sgl-project#5280)

Signed-off-by: windsonsea <[email protected]>

* fix: enable fp4 compilation on cu128 (sgl-project#5286)

* feat: add cu128 identifier for sgl-kernel (sgl-project#5287)

* chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288)

* chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289)

* [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292)

* [Docs] Supported Model Docs - Major restructuring (sgl-project#5290)

Co-authored-by: zhaochenyang20 <[email protected]>

* fix: update update_wheel_index for cu128 (sgl-project#5300)

* [Docs] Remove the older supported docs section (sgl-project#5301)

* remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298)

* feat: add blackwell Dockerfile (sgl-project#5302)

* feat: add blackwell workflow (sgl-project#5303)

* fix: use fa3 unit test on hopper only (sgl-project#5304)

* misc: update blackwell Dockerfile (sgl-project#5306)

* fix: remove cublas_grouped_gemm (sgl-project#5307)

* fix: update flash attn (sgl-project#5308)

* fix: use deepgemm only on hopper (sgl-project#5310)

* [VLM] Adopt fast image processor by default (sgl-project#5065)

* Adjust ci test threshold (sgl-project#5271)

* Blackwell Cutlass MLA kernel (sgl-project#5142)

* misc: cleanup 3rdparty (sgl-project#5311)

* update variable naming and comments for rocm (sgl-project#5299)

* Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120)

* Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315)

* Fix fa3 window size setup (sgl-project#5316)

* chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317)

* feat: use fa3 mla by default on hopper (sgl-project#5210)

Co-authored-by: yundai424 <[email protected]>
Co-authored-by: hebiao064 <[email protected]>

* Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884)

* Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321)

* refine fused_moe tuning docs (sgl-project#5294)

* Support server based rollout in Verlengine (sgl-project#4848)

Co-authored-by: Jin Pan <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Jinn <[email protected]>

* [Feat] Add sparse attn to sgl-kernel (sgl-project#5327)

* fix: solve cu118 issue for cutlass mla (sgl-project#5331)

* chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332)

* ci: update release node (sgl-project#5333)

* fix: determine if flashinfer is installed (sgl-project#5336)

* feat: adapt merge_state (sgl-project#5337)

* misc: update sagemaker Dockerfile (sgl-project#5341)

* Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322)

* docs: update adoption and sponsorship list with Oracle (sgl-project#5343)

* chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342)

* Fix typo: infight -> inflight (sgl-project#5357)

* [PD] Add transfer backend abstraction (sgl-project#5328)

* fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161)

Co-authored-by: AniZpZ <[email protected]>

* fix sgl-project#5322 (sgl-project#5359)

* feat: update experiment_runner (sgl-project#5360)

* [DeepEP] Reduce routed scaling overhead (sgl-project#5277)

Co-authored-by: Cheng Wan <[email protected]>

* Free metadata_buffer_index after transfer finished (sgl-project#5364)

* Free metadata_buffer_index after transfer finished (sgl-project#5364)

* Fix DeepSeek DP Attention + torch compile (sgl-project#5367)

Co-authored-by: ispobock <[email protected]>

* Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003)

* Fix PD disaggregation bugs (sgl-project#5326)

* [PD Bug] fix  MLA get_contiguous_buf_infos error (sgl-project#5384)

* [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370)

* Apply deepseek cuda rope (sgl-project#5385)

Co-authored-by: Yineng Zhang <[email protected]>

* apply fused moe gate in ds v3/r1 (sgl-project#5371)

Co-authored-by: Yineng Zhang <[email protected]>

* fix: update test config (sgl-project#5392)

* [Fix] Turn off DeepGEMM by default (sgl-project#5263)

* minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393)

* Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368)

* Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291)

Co-authored-by: ximing.wxm <[email protected]>

* [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279)

* chore: upgrade DeepGEMM (sgl-project#5395)

* fix: update pr-test-sgl-kernel (sgl-project#5399)

* kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381)

* chore: bump sgl-kernel 0.0.9 (sgl-project#5400)

* chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401)

* Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406)

* Fix bench_serving with random-ids (sgl-project#5214)

* [misc] fix ci flaky case (sgl-project#5352)

* [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412)

* Support dynamic connection and TP 16 (sgl-project#5351)

Co-authored-by: luoyuan.luo <[email protected]>

* Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416)

* [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415)

Signed-off-by: Shangming Cai <[email protected]>
Co-authored-by: ybyang <[email protected]>

* Distinguish bootstrap key only in decode server (sgl-project#5422)

* [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423)

* [minor] cleanup cmakelists.txt (sgl-project#5420)

* bugfix: fix merge_state_v2 cuda graph (sgl-project#5419)

* chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430)

* fix: solve release issue (sgl-project#5434)

* BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431)

* feat: update model_specific_adjustment (sgl-project#5344)

Co-authored-by: hebiao064 <[email protected]>

* chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436)

* Fix ignore_eos parameter when loading a chat template (sgl-project#5264)

* add attention backend supporting matrix in the doc (sgl-project#5211)

Co-authored-by: Stefan He <[email protected]>

* Support BNB quantization for llama/mllama (sgl-project#5038)

Co-authored-by: Yuhao Yang <[email protected]>

* [Docs] Update start/install.md (sgl-project#5398)

* [Minor] Move torch.compile patch to a better place (sgl-project#5397)

* [Bug fix] need record start time in pd mode (sgl-project#5425)

* Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113)

* chore: bump v0.4.5.post1 (sgl-project#5445)

* Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)"

This reverts commit 0eac714.

---------

Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Kay Yan <[email protected]>
Signed-off-by: windsonsea <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Juwan Yoo <[email protected]>
Co-authored-by: Qingquan Song <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: chaobo jia <[email protected]>
Co-authored-by: rudy152 <[email protected]>
Co-authored-by: Fr4nk1in <[email protected]>
Co-authored-by: yinfan98 <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>
Co-authored-by: Sleepcoo <[email protected]>
Co-authored-by: SEPLOS <[email protected]>
Co-authored-by: ch-wan <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: Jinyan Chen <[email protected]>
Co-authored-by: laixinn <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>
Co-authored-by: GeLee <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: hebiao064 <[email protected]>
Co-authored-by: zcnrex <[email protected]>
Co-authored-by: Kaiyu Yang <[email protected]>
Co-authored-by: renxin <[email protected]>
Co-authored-by: saltyfish66 <[email protected]>
Co-authored-by: yuethe <[email protected]>
Co-authored-by: simveit <[email protected]>
Co-authored-by: Yifan Zhang <[email protected]>
Co-authored-by: Ravi Theja <[email protected]>
Co-authored-by: Ravi Theja Desetty <[email protected]>
Co-authored-by: AniZpZ <[email protected]>
Co-authored-by: 晟海 <[email protected]>
Co-authored-by: Tommy Yang <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: inkcherry <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: tianhaoyu <[email protected]>
Co-authored-by: DarkSharpness <[email protected]>
Co-authored-by: Yun Dai <[email protected]>
Co-authored-by: Hubert Lu <[email protected]>
Co-authored-by: huangtingwei <[email protected]>
Co-authored-by: kk <[email protected]>
Co-authored-by: wunhuang <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: saienduri <[email protected]>
Co-authored-by: DangKai <[email protected]>
Co-authored-by: dangkai.dk <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: Ma Mingfei <[email protected]>
Co-authored-by: Chunyuan WU <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: blzheng <[email protected]>
Co-authored-by: Byron Hsu <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: zhaochenyang20 <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: Kay Yan <[email protected]>
Co-authored-by: grimoire <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: Zhaoyang Hao <[email protected]>
Co-authored-by: Teng Ma <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>
Co-authored-by: Xuchun Shang <[email protected]>
Co-authored-by: Richard Zou <[email protected]>
Co-authored-by: Elfie Guo <[email protected]>
Co-authored-by: Michael Yao <[email protected]>
Co-authored-by: Yusong Gao <[email protected]>
Co-authored-by: Zhaoyi Li <[email protected]>
Co-authored-by: lambert0312 <[email protected]>
Co-authored-by: tianlian yi <[email protected]>
Co-authored-by: Jin Pan <[email protected]>
Co-authored-by: Jinn <[email protected]>
Co-authored-by: yulei <[email protected]>
Co-authored-by: Yongtong Wu <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: Ximingwang-09 <[email protected]>
Co-authored-by: ximing.wxm <[email protected]>
Co-authored-by: Yangcheng Li <[email protected]>
Co-authored-by: DefTruth <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: mRSun15 <[email protected]>
Co-authored-by: ryang <[email protected]>
Co-authored-by: Yuhao Yang <[email protected]>
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 16, 2025
* fix: update pr-test-sgl-kernel (sgl-project#5399)

* kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381)

* chore: bump sgl-kernel 0.0.9 (sgl-project#5400)

* chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401)

* Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406)

* Fix bench_serving with random-ids (sgl-project#5214)

* [misc] fix ci flaky case (sgl-project#5352)

* [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412)

* Support dynamic connection and TP 16 (sgl-project#5351)

Co-authored-by: luoyuan.luo <[email protected]>

* Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416)

* [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415)

Signed-off-by: Shangming Cai <[email protected]>
Co-authored-by: ybyang <[email protected]>

* Distinguish bootstrap key only in decode server (sgl-project#5422)

* [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423)

* [minor] cleanup cmakelists.txt (sgl-project#5420)

* bugfix: fix merge_state_v2 cuda graph (sgl-project#5419)

* chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430)

* fix: solve release issue (sgl-project#5434)

* BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431)

* feat: update model_specific_adjustment (sgl-project#5344)

Co-authored-by: hebiao064 <[email protected]>

* chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436)

* Fix ignore_eos parameter when loading a chat template (sgl-project#5264)

* add attention backend supporting matrix in the doc (sgl-project#5211)

Co-authored-by: Stefan He <[email protected]>

* Support BNB quantization for llama/mllama (sgl-project#5038)

Co-authored-by: Yuhao Yang <[email protected]>

* [Docs] Update start/install.md (sgl-project#5398)

* [Minor] Move torch.compile patch to a better place (sgl-project#5397)

* [Bug fix] need record start time in pd mode (sgl-project#5425)

* Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113)

* chore: bump v0.4.5.post1 (sgl-project#5445)

* Fix several minor issues in PD disaggregation (sgl-project#5444)

* [doc] Update benchmark_and_profiling.md (sgl-project#5449)

* Update cutlass dependency. (sgl-project#5447)

* add multi-lora feature in README.md (sgl-project#5463)

* Clean up imports (sgl-project#5467)

* [verl] Modify the update_weights func to align with verl's resharding (sgl-project#5345)

Co-authored-by: Chayenne <[email protected]>

* [Model Support] unsloth/Phi-4-mini bnb model (sgl-project#4982)

Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* Update attention_backend.md: plural form (sgl-project#5489)

* Add test for flash_attn_varlen_func kernel (sgl-project#5484)

* Deprecate disable-mla (sgl-project#5481)

* Deprecate enable-flashinfer-mla and enable-flashmla (sgl-project#5480)

* Feat/support encoder model (like bert) (sgl-project#4887)

* Enable local attention during decode (sgl-project#5479)

* Refactor DeepSeek decoder layer branches (sgl-project#5205)

* Fix a link in sgl-kernel/README.md (sgl-project#5493)

* [Bug fix] use correct func path in deepseek (sgl-project#5496)

Signed-off-by: Xuchun Shang <[email protected]>

* Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (sgl-project#5503)

* [Feat] Update sgl-kernel flashinfer to latest main version (sgl-project#5500)

Co-authored-by: zhyncs <[email protected]>

* Fix: Incorrect parameters passed to forward_batch_generation (sgl-project#5506) (sgl-project#5511)

* Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (sgl-project#5426)

Co-authored-by: ocss884 <[email protected]>

* [docs] Fix several consistency issues in sampling_params.md (sgl-project#5373)

Signed-off-by: windsonsea <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>

* Configuration qwen2_moe.py - qkv_bias now in transformers (sgl-project#5512)

* Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (sgl-project#4836)

* Sgl kernel fused_moe_gate support n_shared_experts (sgl-project#5440)

* chore: bump sgl-kernel 0.0.9.post2 (sgl-project#5518)

* use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (sgl-project#5473)

Co-authored-by: Zhang Kaihong <[email protected]>

* fix kimi vl running bug after rebase main (sgl-project#5461)

* fix bug of VLLM_AVAILABLE not defined (sgl-project#5497)

* Avoid computing lse in Ragged Prefill when there's no prefix. (sgl-project#5476)

Co-authored-by: Baizhou Zhang <[email protected]>

* [Model] Adding Qwen3 and Qwen3MoE (sgl-project#4693)

* fix util import (sgl-project#5542)

* Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (sgl-project#5544)

* chore: upgrade sgl-kernel 0.0.9.post2 (sgl-project#5540)

* Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (sgl-project#5340)

* Make profiler output file names consistent (sgl-project#5548)

* [PD] Tiny fix timeout error when generate (sgl-project#5545)

* [PD] Fix no cache connect for recevier (sgl-project#5534)

* feat: use flashinfer jit package (sgl-project#5547)

* [PD] Remove the requirement of config file for mooncake backend  (sgl-project#5460)

* restruct compressed_tensors_w8a8_fp8 (sgl-project#5475)

* simplify the control logic for using shared experts fusion (sgl-project#5504)

* Remove one kernel in per_tensor_quant_mla_fp8 (sgl-project#5549)

* Fix sampler nan check when calling top_k_top_p_sampling_from_probs (sgl-project#5546)

* [PD] Support page size > 1 (sgl-project#5561)

* fix hicache write back (sgl-project#5543)

* Minor update for ROCm variable style (sgl-project#5562)

* Fix bench_one_batch producing unnatural results for expert parallel (sgl-project#5149)

* [perf] introduce deep gemm group_gemm_masked as bmm (sgl-project#5432)

* [PD] Fix DeepSeek cannot be run on latest master (sgl-project#5568)

* Fix BumpAllocator error when no input_ids (sgl-project#5564)

* enable DeepSeek V3 shared_experts_fusion in sm90 (sgl-project#5571)

* [Fix] fix outlines and xgrammar (sgl-project#4947)

* [Doc]Add instruction for profiling with bench_one_batch (sgl-project#5581)

* Release v0.4.5.post2 (sgl-project#5582)

* Fix bench_serving fail when zero warmup requests (sgl-project#5574)

* Fix DeepEP cannot run on latest master (sgl-project#5567)

* Fix torch memory saver not enabled in DP scenario (sgl-project#5560)

* Super tiny fix typo (sgl-project#5559)

* Add document for LoRA serving (sgl-project#5521)

* Tiny improve error message (sgl-project#5526)

* [PD] Fix server crash when using batch requests (sgl-project#5531)

* [Feat] upgrade pytorch2.6 (sgl-project#5417)

* Fix enable chunked prefill for Llama4 (sgl-project#5575)

* fix: use fa3 for gemma2 (sgl-project#5586)

* Fix ChatCompletionMessageGenericParam to allow for None content (sgl-project#5452)

* [PD] Fix large page size + chunk prefill (sgl-project#5588)

* Add test config yamls for Deepseek v3 (sgl-project#5433)

* [Feature] Prefill assistant response - add continue_final_message parameter (sgl-project#4226)

Co-authored-by: Chayenne <[email protected]>

* add function call parser for DeepSeek V3 (sgl-project#5224)

* smaller and non gated models for docs (sgl-project#5378)

* Feat: Implement JSON Mode (response_format.type="json_object") (sgl-project#4733)

Co-authored-by: Kyle Pena <[email protected]>

* check marlin format before attempting conversion (sgl-project#4675)

* compressed_tensors: port w8a16 fp8 from vllm (sgl-project#4852)

* Fix one more issue reported by torchfix (sgl-project#4859)

* Add sanity check for max_running_requests (sgl-project#5016)

* Correct grafana heatmap. (sgl-project#5019)

* Perform Batch Tokenization. (sgl-project#5141)

* Speedup shared expert weight construction by avoid cloning (sgl-project#5188)

* Tiny add Engine.flush_cache API (sgl-project#5241)

* [misc] remove is_cuda_available (sgl-project#5319)

* Fix flush cache (sgl-project#5590)

* Add Speculative Decoding Eagle3 topk > 1 (sgl-project#5318)

Co-authored-by: Stefan He <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>

* upstream hicache fixes (sgl-project#5570)

* Tiny add warning when cannot recognize bool env var (sgl-project#5348)

* Modify metrics service endpoint (sgl-project#3443)

* Update protocol.py to fix sgl-project#4589 (sgl-project#4590)

* [Feat.] Enable grafana to show metrics (sgl-project#4718)

Co-authored-by: zhaochenyang20 <[email protected]>

* [Fix] Enhance DP Attention for IPv6 Compatibility (sgl-project#4937)

* Support o1 model on Azure (sgl-project#4980)

Co-authored-by: Shan Yu <[email protected]>

* Tiny remove duplicated code (sgl-project#5021)

* Tiny update error hint (sgl-project#5037)

* Support PD bootstrap fields on /v1/chat/completions endpoint (sgl-project#5488)

* [PD] Fix generate endpoint of min_lb for PD (sgl-project#5598)

Signed-off-by: Shangming Cai <[email protected]>

* [PD] Fix edge case and simplify large page size + chunked prefill (sgl-project#5589)

* [PD] Add NIXL transfer backend  (sgl-project#5477)

* [PD] Support decode overlap schedule (sgl-project#5608)

* [PD] Support prefill overlap + Ensure no race condition (sgl-project#5609)

* Enhance GPU memory settings (sgl-project#5604)

* [feature] enable pre compile jit deep_gemm (sgl-project#5580)

* Clean up mem settings (sgl-project#5610)

* Support aiter RMSNorm in AMD (sgl-project#5510)

Co-authored-by: JieXin Liang <[email protected]>

* chore: bump v0.4.5.post3 (sgl-project#5611)

* Remove extra copy in deepseek forward absorb (sgl-project#5578)

Co-authored-by: saienduri <[email protected]>

* [Doc] Fix a 404 link to llama-405b (sgl-project#5615)

Signed-off-by: windsonsea <[email protected]>

* [fix] force use deepgemm in compile_deep_gemm (sgl-project#5618)

* [fix] fix compile_deep_gemm missing kv_b_proj (sgl-project#5620)

* fix: gemma 3 not use softcap (sgl-project#5622)

* Fix FA3 DeepSeek prefill performance regression (sgl-project#5624)

Co-authored-by: ispobock <[email protected]>

* [NFC] Remove duplicate `compressed-tensors` (sgl-project#5640)

* Fix shared experts fusion error without quantization (sgl-project#5632)

* [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (sgl-project#5641)

Co-authored-by: yuethe <[email protected]>

* fix flashmla bug (sgl-project#5272)

* [fix] reduce dp capture bs (sgl-project#5634)

Co-authored-by: alcanerian <[email protected]>

* Remove q concat in FA3 backend for DeepSeek decode (sgl-project#5638)

* Revert "Support aiter RMSNorm in AMD" (sgl-project#5646)

* fix: update bench_speculative (sgl-project#5649)

* Turn on DeepGemm By Default and Update Doc (sgl-project#5628)

* Fuse q_a_proj and kv_a_proj (sgl-project#5619)

* Remove unnecessary `torch.full` in DeepSeek (sgl-project#5601)

* [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell (sgl-project#5281)

* fix sgl-kernel unit tests (sgl-project#5666)

* fix awq_dequantize import (sgl-project#5669)

* Integrating PD disaggregation with DP attention and DeepEP (sgl-project#5435)

Co-authored-by: Byron Hsu <[email protected]>

* fix gemma3 unit test (sgl-project#5670)

* fix torchvision::nms not exist (sgl-project#5671)

* [PD] Add support for dp attention with mooncake (sgl-project#5530)

Signed-off-by: Shangming Cai <[email protected]>

* tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py (sgl-project#5677)

* [Doc] Fix two 404 links caused by sglang typo (sgl-project#5667)

Signed-off-by: windsonsea <[email protected]>

* fix: update truss bench_serving (sgl-project#5683)

* fix: only compile ApplyTokenBitmaskInplace cu124+ (sgl-project#5686)

* chore: bump sgl-kernel 0.1.0 (sgl-project#5688)

* vlm: enable radix cache for qwen-vl models (sgl-project#5349)

Co-authored-by: Xinyuan Tong <[email protected]>

* [BugFix] Fix combination of MTP and `--n-share-experts-fusion`with R1 (sgl-project#5707)

* Fix weight loading bug for Deepseek v3+nextn (sgl-project#5684)

* Add example to use sgl engine with fastapi (sgl-project#5648)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* [Doc] Fix a link to Weilin Zhao (sgl-project#5706)

Signed-off-by: windsonsea <[email protected]>

* Add MMMU benchmark results (sgl-project#4491)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* [Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct) (sgl-project#5078)

Co-authored-by: vincent-4 <[email protected]>

* [PD] Better logs (sgl-project#5715)

* [PD] Add kvargs table and thread pool for kvcache sender of mooncake (sgl-project#5738)

Signed-off-by: Shangming Cai <[email protected]>

* [PD]: Support Muti Prefill in one node (sgl-project#5704)

Co-authored-by: shuaills <[email protected]>

* Fix: deepseek forward absorb (sgl-project#5723)

Co-authored-by: ispobock <[email protected]>

* Pin torch audio to 2.6.0 (sgl-project#5750)

* Revert "[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct)" (sgl-project#5754)

* Disable flaky eagle tests (sgl-project#5753)

* update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. (sgl-project#5740)

* [Docs] Update runtime/engine/readme.md (sgl-project#5737)

Signed-off-by: windsonsea <[email protected]>

* Reorder loop in shared expert weight loading (sgl-project#5719)

* fix: fix one more bug from merging mm_inputs (sgl-project#5718)

Co-authored-by: Xinyuan Tong <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>

* [Fix]: support deepseek-vl2-tiny model (sgl-project#5552)

Co-authored-by: bppps <[email protected]>

* Bugfix for minicpmo vision test (sgl-project#5760)

* [Minor] fix documentations (sgl-project#5756)

* Add an assertion to enhance the robustness of the operator (sgl-project#5736)

* fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 (sgl-project#5733)

* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <[email protected]>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <[email protected]>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <[email protected]>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <[email protected]>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <[email protected]>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* fix for hpu backend in model runner and server args

Signed-off-by: Mohit Sinha <[email protected]>

* rebase formatting issue

Signed-off-by: Mohit Sinha <[email protected]>

* [SW-228218]: Fix device mismatch in frequency penalty.

Ensure tensors in BatchedFrequencyPenalizer are on the same device by
moving output_ids and frequency_penalties to the device of
cumulated_frequency_penalties. This resolves a RuntimeError
caused by tensors on cpu and hpu:0 during logits subtraction.

---------

Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Xuchun Shang <[email protected]>
Signed-off-by: windsonsea <[email protected]>
Signed-off-by: Kebe <[email protected]>
Signed-off-by: congcongke <[email protected]>
Signed-off-by: Mohit Sinha <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: DefTruth <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: Zhaoyang Hao <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: lambert0312 <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: hebiao064 <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: mRSun15 <[email protected]>
Co-authored-by: ryang <[email protected]>
Co-authored-by: Yuhao Yang <[email protected]>
Co-authored-by: Michael Yao <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Elfie Guo <[email protected]>
Co-authored-by: Ying Sheng <[email protected]>
Co-authored-by: BearBiscuit <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: eigen <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Didier Durand <[email protected]>
Co-authored-by: woodx <[email protected]>
Co-authored-by: Xuchun Shang <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: PGFLMG <[email protected]>
Co-authored-by: u4lr451 <[email protected]>
Co-authored-by: ocss884 <[email protected]>
Co-authored-by: Michael Feil <[email protected]>
Co-authored-by: strgrb <[email protected]>
Co-authored-by: Zhang Kaihong <[email protected]>
Co-authored-by: liwenju0 <[email protected]>
Co-authored-by: Wenxuan Tan <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: Byron Hsu <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: Zhaoyi Li <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: AmadeusW <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>
Co-authored-by: Yi Zhou <[email protected]>
Co-authored-by: simveit <[email protected]>
Co-authored-by: kyle-pena-kuzco <[email protected]>
Co-authored-by: Kyle Pena <[email protected]>
Co-authored-by: Enrique Shockwave <[email protected]>
Co-authored-by: Juwan Yoo <[email protected]>
Co-authored-by: Brayden Zhong <[email protected]>
Co-authored-by: mac0ne <[email protected]>
Co-authored-by: Sundara Raman Ramachandran <[email protected]>
Co-authored-by: Qingquan Song <[email protected]>
Co-authored-by: moontidef <[email protected]>
Co-authored-by: Huapeng Zhou <[email protected]>
Co-authored-by: Lucius <[email protected]>
Co-authored-by: Chuyue Sun <[email protected]>
Co-authored-by: Shan Yu <[email protected]>
Co-authored-by: Yongtong Wu <[email protected]>
Co-authored-by: michael-amd <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: saienduri <[email protected]>
Co-authored-by: ispobock <[email protected]>
Co-authored-by: Connector Switch <[email protected]>
Co-authored-by: saltyfish66 <[email protected]>
Co-authored-by: yuethe <[email protected]>
Co-authored-by: alcanerian <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Xinyuan Tong <[email protected]>
Co-authored-by: Ravi Theja <[email protected]>
Co-authored-by: Ravi Theja Desetty <[email protected]>
Co-authored-by: vincent-4 <[email protected]>
Co-authored-by: IAN <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>
Co-authored-by: ZXN <[email protected]>
Co-authored-by: bppps <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: Kyungmin Lee <[email protected]>
Co-authored-by: vzed <[email protected]>
Co-authored-by: DavidBao <[email protected]>
Co-authored-by: Frankey_8080 <[email protected]>
Co-authored-by: yan97ao <[email protected]>
Co-authored-by: aoshen524 <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: Kebe <[email protected]>
Co-authored-by: zhanweidu <[email protected]>
Co-authored-by: NoahM <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: JiLi <[email protected]>
Co-authored-by: sighingnow <[email protected]>
Co-authored-by: XTY <[email protected]>
Co-authored-by: vikram singh shekhawat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants