Skip to content

Add Cutlass MLA attention backend #5390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 28, 2025

Conversation

trevor-m
Copy link
Collaborator

@trevor-m trevor-m commented Apr 14, 2025

Motivation

Enables use of the blackwell cutlass MLA decode kernel with deepseek models.

Modifications

Adds "cutlass_mla" option for attention backend.

Deepseek-R1 Benchmarks

python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --tp 8 --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --enable-dp-attention --attention-backend cutlass_mla --dtype float16 --dp 8 --page-size 128
python3 -m sglang.bench_serving --backend sglang --model deepseek-ai/DeepSeek-R1 --num-prompts 3000 --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1

Using --attention-backend cutlass_mla

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     3000
Benchmark duration (s):                  574.31
Total input tokens:                      3000000
Total generated tokens:                  3000000
Total generated tokens (retokenized):    2990164
Request throughput (req/s):              5.22
Input token throughput (tok/s):          5223.67
Output token throughput (tok/s):         5223.67
Total token throughput (tok/s):          10447.34
Concurrency:                             2991.95
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   572768.28
Median E2E Latency (ms):                 572918.73
---------------Time to First Token----------------
Mean TTFT (ms):                          134177.24
Median TTFT (ms):                        133475.88
P99 TTFT (ms):                           261181.99
---------------Inter-Token Latency----------------
Mean ITL (ms):                           439.04
Median ITL (ms):                         309.17
P95 ITL (ms):                            392.38
P99 ITL (ms):                            607.01
Max ITL (ms):                            249723.06
==================================================

Baseline --attention-backend triton

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     3000
Benchmark duration (s):                  729.28
Total input tokens:                      3000000
Total generated tokens:                  3000000
Total generated tokens (retokenized):    2984072
Request throughput (req/s):              4.11
Input token throughput (tok/s):          4113.67
Output token throughput (tok/s):         4113.67
Total token throughput (tok/s):          8227.35
Concurrency:                             2988.77
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   726545.79
Median E2E Latency (ms):                 727793.85
---------------Time to First Token----------------
Mean TTFT (ms):                          157159.71
Median TTFT (ms):                        155781.05
P99 TTFT (ms):                           309518.97
---------------Inter-Token Latency----------------
Mean ITL (ms):                           569.97
Median ITL (ms):                         411.70
P95 ITL (ms):                            482.26
P99 ITL (ms):                            714.08
Max ITL (ms):                            302479.69
==================================================

Usage

python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --tp 8 --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --enable-dp-attention --attention-backend cutlass_mla --dtype float16 --dp 8 --page-size 128

curl -s http://localhost:30000/v1/chat/completions   -d '{"model": "deepseek-ai/DeepSeek-R1", "messages": [{"role": "user", "content": "Which number is bigger 9.9 or 9.11?"}]}'

Output:

{"id":"c9268950ad3d4defbef1ad3bc2de60d7","object":"chat.completion","created":1744672138,"model":"deepseek-ai/DeepSeek-R1","choices":[{"index":0,"message":{"role":"assistant","content":"First, I need to compare the two numbers: 9.9 and 9.11.\n\nTo make the comparison easier, I'll align their decimal places by writing them with the same number of decimal places. This means writing 9.9 as 9.90 and 9.11 remains as 9.11.\n\nNow, I'll compare the whole number parts first. Both numbers have the same whole number part, which is 9.\n\nNext, I'll compare the tenths place. In 9.90, the tenths digit is 9, and in 9.11, the tenths digit is 1. Since 9 is greater than 1, 9.90 is greater than 9.11.\n\nTherefore, 9.9 is greater than 9.11.\n</think>\n\nTo compare the numbers \\(9.9\\) and \\(9.11\\):\n\n1. **Align the Decimal Places:**\n   - \\(9.90\\)  \n   - \\(9.11\\)\n\n2. **Compare Digit by Digit:**\n   - **Whole Number Part:** Both have the same whole number part (\\(9\\)).\n   - **Tenths Place:** \\(9\\) (from \\(9.90\\)) vs. \\(1\\) (from \\(9.11\\)).\n\nSince \\(9 > 1\\), \\(9.90 > 9.11\\).\n\n\\[\n\\boxed{9.9 \\text{ is bigger than } 9.11}\n\\]","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":1}],"usage":{"prompt_tokens":19,"total_tokens":326,"completion_tokens":307,"prompt_tokens_details":null}}

Checklist

@zhyncs zhyncs self-assigned this Apr 14, 2025
@trevor-m trevor-m force-pushed the mla-backend-upstream branch from 85afe63 to 99183b8 Compare April 14, 2025 23:34
@trevor-m trevor-m changed the title Add Cutlass MLA attention backend Draft: Add Cutlass MLA attention backend Apr 15, 2025
@hebiao064
Copy link
Collaborator

would you mind share some benchmark on latency and accuracy comparing with FA3 and FlashInfer?

Thanks

@trevor-m trevor-m force-pushed the mla-backend-upstream branch from 99183b8 to 54400f8 Compare April 23, 2025 00:02
@trevor-m trevor-m changed the title Draft: Add Cutlass MLA attention backend Add Cutlass MLA attention backend Apr 23, 2025
@trevor-m
Copy link
Collaborator Author

would you mind share some benchmark on latency and accuracy comparing with FA3 and FlashInfer?

Thanks

Hi @hebiao064, cutlass_mla is a blackwell kernel. FA3 and flashiner are for hopper so they cannot be compared. I've added some benchmarks comparing against triton to the PR description.

@trevor-m
Copy link
Collaborator Author

@zhyncs I fixed the problem with cuda graphs and added some benchmark results in the description, so this should be ready now.

@zhyncs
Copy link
Member

zhyncs commented Apr 23, 2025

Hi @trevor-m I'll help review and test today.

@merrymercy merrymercy added the ready-to-merge The PR is ready to merge after the CI is green. label Apr 27, 2025
@zhyncs zhyncs merged commit 84810da into sgl-project:main Apr 28, 2025
0 of 10 checks passed
WineChord pushed a commit to WineChord/sglang that referenced this pull request Apr 28, 2025
@trevor-m
Copy link
Collaborator Author

trevor-m commented Apr 29, 2025

@zhyncs @merrymercy After rebasing, there is an error. I bisected it and found it's from #5578

  File "/trevor/sglang/python/sglang/srt/models/deepseek_v2.py", line 632, in forward
    return self.forward_absorb(
           ^^^^^^^^^^^^^^^^^^^^
  File "/trevor/sglang/python/sglang/srt/models/deepseek_v2.py", line 741, in forward_absorb
    attn_output = self.attn_mqa(q, k, k_nope, forward_batch)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/trevor/sglang/python/sglang/srt/layers/radix_attention.py", line 97, in forward
    return forward_batch.attn_backend.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/trevor/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 68, in forward
    return self.forward_decode(
           ^^^^^^^^^^^^^^^^^^^^
  File "/trevor/sglang/python/sglang/srt/layers/attention/cutlass_mla_backend.py", line 270, in forward_decode
    o = cutlass_mla_decode(
        ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/attention.py", line 85, in cutlass_mla_decode
    assert q_nope_and_q_pe.dtype in (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: q_nope_and_q_pe.dtype needs to be fp16 or bf16 but got torch.float32.

@yessenzhar
Copy link

We are running using sglang:blackwell container from docker hub. Installing latest flashinfer from source code inside the container manually. Started sglang server with the following command:
python3 -m sglang.launch_server --model-path /data/deepseek-ai--DeepSeek-R1/ --tp 8 --trust-remote-code --port 8081 --max-running-requests=256 --enable-metrics --enable-cache-report --host 0.0.0.0 --page-size 128 --attention-backend cutlass_mla --enable-dp-attention --dtype float16 --dp 8

Server crashes with the following logs.

Cutlass MLA only supports a page_size of 128, change page_size to 128.
DP attention is enabled. The chunked prefill size is adjusted to 1024 to avoid MoE kernel issues. 
[2025-05-13 23:21:30] server_args=ServerArgs(model_path='/data/weights/vllm-deepseek-ai--DeepSeek-R1/001', tokenizer_path='/data/weights/vllm-deepseek-ai--DeepSeek-R1/001', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=True, dtype='float16', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='/data/weights/vllm-deepseek-ai--DeepSeek-R1/001', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=8081, mem_fraction_static=0.8994758915570003, max_running_requests=256, max_total_tokens=None, chunked_prefill_size=1024, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=0.3, cpu_offload_gb=0, page_size=128, tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=1059134139, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=True, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=True, reasoning_parser=None, dp_size=8, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='cutlass_mla', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=True, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None)
[2025-05-13 23:21:30] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:42 DP0 TP0] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:42 DP0 TP0] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:42 DP0 TP0] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:42 DP0 TP0] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:42 DP0 TP0] Init torch distributed begin.
[2025-05-13 23:21:43 DP7 TP7] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:43 DP7 TP7] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:43 DP7 TP7] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:43 DP7 TP7] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:43 DP7 TP7] Init torch distributed begin.
[2025-05-13 23:21:44 DP6 TP6] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP3 TP3] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP2 TP2] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP1 TP1] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP4 TP4] Casting torch.bfloat16 to torch.float16.
[W513 23:21:44.953759357 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[2025-05-13 23:21:44 DP5 TP5] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP3 TP3] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP3 TP3] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:44 DP3 TP3] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:44 DP3 TP3] Init torch distributed begin.
[2025-05-13 23:21:44 DP6 TP6] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP6 TP6] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:44 DP6 TP6] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:44 DP6 TP6] Init torch distributed begin.
[2025-05-13 23:21:44 DP2 TP2] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP2 TP2] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:44 DP2 TP2] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:44 DP2 TP2] Init torch distributed begin.
[2025-05-13 23:21:44 DP1 TP1] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP1 TP1] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:44 DP1 TP1] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:44 DP1 TP1] Init torch distributed begin.
[2025-05-13 23:21:44 DP4 TP4] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP4 TP4] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:44 DP4 TP4] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:44 DP4 TP4] Init torch distributed begin.
[2025-05-13 23:21:44 DP5 TP5] Casting torch.bfloat16 to torch.float16.
[2025-05-13 23:21:44 DP5 TP5] MLA optimization is turned on. Use cutlass_mla backend.
[2025-05-13 23:21:44 DP5 TP5] Disable chunked prefix cache when page size > 1.
[2025-05-13 23:21:44 DP5 TP5] Init torch distributed begin.
[W513 23:21:46.787549932 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W513 23:21:46.798009709 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W513 23:21:46.814273174 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W513 23:21:46.828043637 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W513 23:21:46.831776039 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W513 23:21:46.863906461 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W513 23:21:46.871817242 ProcessGroupNCCL.cpp:1009] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-05-13 23:21:46 DP1 TP1] sglang is using nccl==2.26.2
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-05-13 23:21:46 DP2 TP2] sglang is using nccl==2.26.2
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-05-13 23:21:46 DP7 TP7] sglang is using nccl==2.26.2
[2025-05-13 23:21:46 DP6 TP6] sglang is using nccl==2.26.2
[2025-05-13 23:21:46 DP3 TP3] sglang is using nccl==2.26.2
[2025-05-13 23:21:46 DP0 TP0] sglang is using nccl==2.26.2
[2025-05-13 23:21:46 DP4 TP4] sglang is using nccl==2.26.2
[2025-05-13 23:21:46 DP5 TP5] sglang is using nccl==2.26.2
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-05-13 23:28:26 DP6 TP6] Init torch distributed ends. mem usage=2.50 GB
[2025-05-13 23:28:26 DP6 TP6] Load weight begin. avail mem=175.25 GB
[2025-05-13 23:28:26 DP4 TP4] Init torch distributed ends. mem usage=2.50 GB
[2025-05-13 23:28:26 DP2 TP2] Init torch distributed ends. mem usage=2.50 GB
[2025-05-13 23:28:26 DP5 TP5] Init torch distributed ends. mem usage=2.50 GB
[2025-05-13 23:28:26 DP0 TP0] Init torch distributed ends. mem usage=2.38 GB
[2025-05-13 23:28:26 DP4 TP4] Load weight begin. avail mem=175.25 GB
[2025-05-13 23:28:26 DP3 TP3] Init torch distributed ends. mem usage=2.50 GB
[2025-05-13 23:28:26 DP7 TP7] Init torch distributed ends. mem usage=1.88 GB
[2025-05-13 23:28:26 DP5 TP5] Load weight begin. avail mem=175.25 GB
[2025-05-13 23:28:26 DP0 TP0] Load weight begin. avail mem=175.37 GB
[2025-05-13 23:28:26 DP2 TP2] Load weight begin. avail mem=175.25 GB
[2025-05-13 23:28:26 DP3 TP3] Load weight begin. avail mem=175.25 GB
[2025-05-13 23:28:26 DP7 TP7] Load weight begin. avail mem=175.87 GB
[2025-05-13 23:28:26 DP1 TP1] Init torch distributed ends. mem usage=2.50 GB
[2025-05-13 23:28:26 DP1 TP1] Load weight begin. avail mem=175.25 GB
[2025-05-13 23:28:27 DP7 TP7] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP7 TP7] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
[2025-05-13 23:28:27 DP6 TP6] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP6 TP6] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
[2025-05-13 23:28:27 DP5 TP5] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP5 TP5] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
[2025-05-13 23:28:27 DP3 TP3] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP3 TP3] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
[2025-05-13 23:28:27 DP4 TP4] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP4 TP4] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
[2025-05-13 23:28:27 DP1 TP1] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP1 TP1] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
[2025-05-13 23:28:27 DP2 TP2] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP2 TP2] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
[2025-05-13 23:28:27 DP0 TP0] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-05-13 23:28:27 DP0 TP0] Deepseek V3/R1 with fp8 can use shared experts fusion optimization when SM version >=90. Shared experts fusion optimization is enabled.
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   9% Completed | 15/163 [00:00<00:01, 143.73it/s]
Loading safetensors checkpoint shards:  18% Completed | 30/163 [00:00<00:00, 138.65it/s]
Loading safetensors checkpoint shards:  27% Completed | 44/163 [00:00<00:00, 131.46it/s]
Loading safetensors checkpoint shards:  36% Completed | 58/163 [00:00<00:01, 79.34it/s]
Loading safetensors checkpoint shards:  45% Completed | 74/163 [00:00<00:00, 97.72it/s]
Loading safetensors checkpoint shards:  54% Completed | 88/163 [00:00<00:00, 108.11it/s]
Loading safetensors checkpoint shards:  63% Completed | 102/163 [00:00<00:00, 116.06it/s]
Loading safetensors checkpoint shards:  71% Completed | 116/163 [00:01<00:00, 121.49it/s]
Loading safetensors checkpoint shards:  80% Completed | 130/163 [00:01<00:00, 124.95it/s]
Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 20188.35it/s]
Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 17190.99it/s]
Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 20420.52it/s]
Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 35292.27it/s]
Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 18512.26it/s]
Loading safetensors checkpoint shards:  88% Completed | 144/163 [00:01<00:00, 80.41it/s]
Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 23932.08it/s]
Loading safetensors checkpoint shards:  98% Completed | 160/163 [00:01<00:00, 95.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:01<00:00, 103.64it/s]

Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 37374.35it/s]
Cloning 8 replicas of the shared expert into MoE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 23481.62it/s]
[2025-05-13 23:29:26 DP3 TP3] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=82.70 GB, mem usage=92.54 GB.
[2025-05-13 23:29:26 DP4 TP4] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=82.70 GB, mem usage=92.54 GB.
[2025-05-13 23:29:28 DP5 TP5] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=82.70 GB, mem usage=92.54 GB.
[2025-05-13 23:29:32 DP7 TP7] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=83.33 GB, mem usage=92.54 GB.
[2025-05-13 23:29:32 DP6 TP6] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=82.70 GB, mem usage=92.54 GB.
[2025-05-13 23:29:33 DP0 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=82.83 GB, mem usage=92.54 GB.
[2025-05-13 23:29:37 DP2 TP2] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=82.70 GB, mem usage=92.54 GB.
[2025-05-13 23:29:42 DP1 TP1] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.float16, avail mem=82.70 GB, mem usage=92.54 GB.
[2025-05-13 23:29:42 DP2 TP2] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP2 TP2] Memory pool end. avail mem=17.49 GB
[2025-05-13 23:29:42 DP1 TP1] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP1 TP1] Memory pool end. avail mem=17.49 GB
[2025-05-13 23:29:42 DP3 TP3] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP3 TP3] Memory pool end. avail mem=17.49 GB
[2025-05-13 23:29:42 DP4 TP4] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP4 TP4] Memory pool end. avail mem=17.49 GB
[2025-05-13 23:29:42 DP5 TP5] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP6 TP6] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP5 TP5] Memory pool end. avail mem=17.49 GB
[2025-05-13 23:29:42 DP6 TP6] Memory pool end. avail mem=17.49 GB
[2025-05-13 23:29:42 DP7 TP7] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP7 TP7] Memory pool end. avail mem=18.12 GB
[2025-05-13 23:29:42 DP0 TP0] KV Cache is allocated. #tokens: 992896, KV size: 64.99 GB
[2025-05-13 23:29:42 DP0 TP0] Memory pool end. avail mem=17.62 GB
2025-05-13 23:29:42,625 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-05-13 23:29:42,625 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-05-13 23:29:42,626 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-05-13 23:29:42,626 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-05-13 23:29:42,626 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-05-13 23:29:42,626 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-05-13 23:29:42,626 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-05-13 23:29:42,638 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-13 23:29:42 DP5 TP5] Capture cuda graph begin. This can take up to several minutes. avail mem=17.00 GB
[2025-05-13 23:29:42 DP6 TP6] Capture cuda graph begin. This can take up to several minutes. avail mem=17.00 GB
[2025-05-13 23:29:42 DP7 TP7] Capture cuda graph begin. This can take up to several minutes. avail mem=17.62 GB
[2025-05-13 23:29:42 DP4 TP4] Capture cuda graph begin. This can take up to several minutes. avail mem=17.00 GB
[2025-05-13 23:29:42 DP0 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=17.12 GB
[2025-05-13 23:29:42 DP1 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=17.00 GB
[2025-05-13 23:29:42 DP3 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=17.00 GB
[2025-05-13 23:29:42 DP2 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=17.00 GB
[2025-05-13 23:29:43 DP0 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (avail_mem=17.03 GB):   0%|                                                                                                                                                                                                                      | 0/35 [00:00<?, ?it/s][2025-05-13 23:29:43 DP5 TP5] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:43 DP4 TP4] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:43 DP3 TP3] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:43 DP7 TP7] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:43 DP6 TP6] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:43 DP0 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:43 DP1 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:43 DP2 TP2] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=2112,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP5 TP5] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP3 TP3] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP6 TP6] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP4 TP4] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP7 TP7] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP0 TP0] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP1 TP1] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:44 DP2 TP2] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=24576,K=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
[2025-05-13 23:29:45 DP6 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP7 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP4 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP3 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP5 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP0 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP2 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP1 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP7 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP5 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP6 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP4 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP3 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP0 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP2 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP1 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4608,K=7168,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP7 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP6 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP5 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP4 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP3 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP2 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP0 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:45 DP1 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=7168,K=2304,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:29:46 DP7 TP7] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-05-13 23:29:46 DP5 TP5] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-05-13 23:29:46 DP6 TP6] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-05-13 23:29:46 DP4 TP4] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-05-13 23:29:46 DP2 TP2] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-05-13 23:29:46 DP3 TP3] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-05-13 23:29:46 DP0 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-05-13 23:29:46 DP1 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (avail_mem=13.96 GB):  97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 34/35 [00:34<00:01,  1.22s/it][2025-05-13 23:30:19 DP2 TP2] Registering 4305 cuda graph addresses
[2025-05-13 23:30:19 DP5 TP5] Registering 4305 cuda graph addresses
Capturing batches (avail_mem=13.96 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:36<00:00,  1.03s/it]
[2025-05-13 23:30:19 DP3 TP3] Registering 4305 cuda graph addresses
[2025-05-13 23:30:19 DP6 TP6] Registering 4305 cuda graph addresses
[2025-05-13 23:30:19 DP1 TP1] Registering 4305 cuda graph addresses
[2025-05-13 23:30:19 DP4 TP4] Registering 4305 cuda graph addresses
[2025-05-13 23:30:19 DP0 TP0] Registering 4305 cuda graph addresses
[2025-05-13 23:30:19 DP7 TP7] Registering 4305 cuda graph addresses
[2025-05-13 23:30:19 DP2 TP2] Capture cuda graph end. Time elapsed: 36.80 s. mem usage=3.19 GB. avail mem=13.81 GB.
[2025-05-13 23:30:19 DP3 TP3] Capture cuda graph end. Time elapsed: 36.81 s. mem usage=3.19 GB. avail mem=13.81 GB.
[2025-05-13 23:30:19 DP4 TP4] Capture cuda graph end. Time elapsed: 36.81 s. mem usage=3.19 GB. avail mem=13.81 GB.
[2025-05-13 23:30:19 DP7 TP7] Capture cuda graph end. Time elapsed: 36.82 s. mem usage=3.19 GB. avail mem=14.43 GB.
[2025-05-13 23:30:19 DP6 TP6] Capture cuda graph end. Time elapsed: 36.82 s. mem usage=3.19 GB. avail mem=13.81 GB.
[2025-05-13 23:30:19 DP5 TP5] Capture cuda graph end. Time elapsed: 36.82 s. mem usage=3.19 GB. avail mem=13.81 GB.
[2025-05-13 23:30:19 DP1 TP1] Capture cuda graph end. Time elapsed: 36.82 s. mem usage=3.19 GB. avail mem=13.81 GB.
[2025-05-13 23:30:19 DP0 TP0] Capture cuda graph end. Time elapsed: 36.82 s. mem usage=3.19 GB. avail mem=13.93 GB.
[2025-05-13 23:30:19 DP4 TP4] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:19 DP2 TP2] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:19 DP0 TP0] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:19 DP5 TP5] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:19 DP3 TP3] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:19 DP6 TP6] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:19 DP7 TP7] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:19 DP1 TP1] max_total_num_tokens=992896, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=32, context_len=163840
[2025-05-13 23:30:20] INFO:     Started server process [301]
[2025-05-13 23:30:20] INFO:     Waiting for application startup.
[2025-05-13 23:30:20] INFO:     Application startup complete.
[2025-05-13 23:30:20] INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
[2025-05-13 23:30:21] INFO:     127.0.0.1:37890 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-13 23:30:21 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:30:21 DP2 TP2] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:30:21 DP6 TP6] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:30:21 DP1 TP1] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:30:21 DP4 TP4] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:30:21 DP5 TP5] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:30:21 DP3 TP3] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:30:21 DP7 TP7] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-13 23:30:24,977 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:24,999 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:25,043 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:25,096 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:25,101 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:25,343 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:25,350 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:25,411 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-13 23:30:38,682 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP4 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
2025-05-13 23:30:38,703 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP0 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
2025-05-13 23:30:38,748 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP1 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
2025-05-13 23:30:38,769 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP3 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
2025-05-13 23:30:38,801 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP5 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
2025-05-13 23:30:38,858 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP2 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
2025-05-13 23:30:38,909 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP6 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
2025-05-13 23:30:38,966 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_192_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-13 23:30:38 DP7 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-05-13 23:30:43] INFO:     127.0.0.1:37894 - "POST /generate HTTP/1.1" 200 OK
[2025-05-13 23:30:43] The server is fired up and ready to roll!



[2025-05-13 23:31:51 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:31:51 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-13 23:31:53,698 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
Token indices sequence length is longer than the specified maximum sequence length for this model (21987 > 16384). Running this sequence through the model will result in indexing errors
2025-05-13 23:32:13,065 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
[2025-05-13 23:32:13 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 1
[2025-05-13 23:32:13 DP4 TP4] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 2
[2025-05-13 23:32:13 DP1 TP1] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 2
[2025-05-13 23:32:13 DP2 TP2] Prefill batch. #new-seq: 2, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 1
[2025-05-13 23:32:13 DP3 TP3] Prefill batch. #new-seq: 3, #new-token: 91, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:32:13 DP5 TP5] Prefill batch. #new-seq: 3, #new-token: 91, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:32:13 DP6 TP6] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 1
[2025-05-13 23:32:13 DP7 TP7] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:32:16 DP4 TP4] Prefill batch. #new-seq: 3, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-13 23:32:16 DP0 TP0] Prefill batch. #new-seq: 2, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 1
[2025-05-13 23:32:16 DP6 TP6] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 2
[2025-05-13 23:32:16 DP1 TP1] Prefill batch. #new-seq: 2, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 1
[2025-05-13 23:32:16 DP7 TP7] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-05-13 23:32:16 DP2 TP2] Prefill batch. #new-seq: 2, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0
2025-05-13 23:32:17,537 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:17,540 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:17,541 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:17,543 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:17,558 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:17,611 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:17,659 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:17,712 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
[2025-05-13 23:32:19 DP6 TP6] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 2
[2025-05-13 23:32:19 DP2 TP2] Prefill batch. #new-seq: 2, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-05-13 23:32:19 DP1 TP1] Prefill batch. #new-seq: 2, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 1
[2025-05-13 23:32:19 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 1024, #cached-token: 0, token usage: 0.00, #running-req: 2, #queue-req: 1
[2025-05-13 23:32:19 DP3 TP3] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 0, token usage: 0.00, #running-req: 3, #queue-req: 0
[2025-05-13 23:32:19 DP7 TP7] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #running-req: 2, #queue-req: 0
2025-05-13 23:32:19,312 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-05-13 23:32:19,332 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [18,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [10,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [19,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [24,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [25,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [54,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [55,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [48,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [49,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [30,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [42,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [31,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1504: indexSelectSmallIndex: block: [43,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[2025-05-13 23:32:20 DP4 TP4] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 118, in forward_thread_func
    self.forward_thread_func_()
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 148, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 206, in forward_batch_generation
    logits_output = self.model_runner.forward(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1095, in forward
    return self.forward_decode(forward_batch, pp_proxy_tensors=pp_proxy_tensors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1036, in forward_decode
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1522, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1755, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1766, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1442, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1755, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1766, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1224, in forward
    return self.forward_ffn_with_full_input(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1284, in forward_ffn_with_full_input
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1755, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1766, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 304, in forward
    return self.forward_normal(hidden_states)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 313, in forward_normal
    self.experts(hidden_states=hidden_states, router_logits=router_logits)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1755, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1766, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 631, in forward
    final_hidden_states = self.quant_method.apply(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 966, in apply
    return fused_experts(
           ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1258, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/opt/conda/lib/python3.11/site-packages/torch/_ops.py", line 1208, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1097, in inplace_fused_experts
    fused_experts_impl(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1431, in fused_experts_impl
    sorted_token_ids, expert_ids, num_tokens_post_padded = moe_align_block_size(
                                                           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 696, in moe_align_block_size
    sorted_ids.fill_(topk_ids.numel())
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 117, in forward_thread_func
    with torch.get_device_module(self.device).stream(self.forward_stream):
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 659, in __exit__
    torch.cuda.set_stream(self.src_prev_stream)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 701, in set_stream
    _set_stream_by_id(
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 683, in _set_stream_by_id
    torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[rank4]:[E513 23:32:20.992998317 ProcessGroupNCCL.cpp:1981] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x76d89897e628 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x55 (0x76d89891a24f in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3e2 (0x76d90ca645d2 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x76d80c45be16 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x76d80c46c110 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x76c (0x76d80c46d7dc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76d80c46f24d in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x76d91bb9dbf4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x76d91ed281ca in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x76d91e1f98d3 in /usr/lib64/libc.so.6)

Fatal Python error: Aborted

Thread 0x000076bb8effd700 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1778 in watchdog_thread
  File "/opt/conda/lib/python3.11/threading.py", line 975 in run
  File "/opt/conda/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/opt/conda/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x000076bb8ffff700 (most recent call first):
  File "/opt/conda/lib/python3.11/threading.py", line 324 in wait
  File "/opt/conda/lib/python3.11/threading.py", line 622 in wait
  File "/opt/conda/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/opt/conda/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x000076d3faffd700 (most recent call first):
  File "/opt/conda/lib/python3.11/threading.py", line 324 in wait
  File "/opt/conda/lib/python3.11/threading.py", line 622 in wait
  File "/opt/conda/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/opt/conda/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x000076d54ffec700 (most recent call first):
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 57 in _recv_msg
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 182 in _read_thread
  File "/opt/conda/lib/python3.11/threading.py", line 975 in run
  File "/opt/conda/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/opt/conda/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x000076d91f168640 (most recent call first):
  File "/opt/conda/lib/python3.11/threading.py", line 320 in wait
  File "/opt/conda/lib/python3.11/threading.py", line 622 in wait
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 184 in resolve_last_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 50 in process_batch_result_prefill
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1604 in process_batch_result
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 691 in event_loop_overlap
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2233 in run_scheduler_process
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/opt/conda/lib/python3.11/multiprocessing/spawn.py", line 133 in _main
  File "/opt/conda/lib/python3.11/multiprocessing/spawn.py", line 120 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, _cffi_backend, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, uvloop.loop, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, zmq.backend.cython._zmq, PIL._imaging, psutil._psutil_linux, psutil._psutil_posix, setproctitle._setproctitle, yaml._yaml, markupsafe._speedups, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sentencepiece._sentencepiece, zstandard.backend_c, cuda_utils, regex._regex, __triton_launcher (total: 44)
[2025-05-13 23:32:22] Child process unexpectedly failed with an exit code 131. pid=434
^CProcess Process-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/detokenizer_manager.py", line 262, in run_detokenizer_process
    manager.event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/detokenizer_manager.py", line 106, in event_loop
    recv_obj = self.recv_from_scheduler.recv_pyobj()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/zmq/sugar/socket.py", line 989, in recv_pyobj
    msg = self.recv(flags)
          ^^^^^^^^^^^^^^^^
  File "_zmq.py", line 1147, in zmq.backend.cython._zmq.Socket.recv
  File "_zmq.py", line 1182, in zmq.backend.cython._zmq.Socket.recv
  File "_zmq.py", line 1337, in zmq.backend.cython._zmq._recv_copy
  File "_zmq.py", line 169, in zmq.backend.cython._zmq._check_rc

pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 16, 2025
* fix: update pr-test-sgl-kernel (sgl-project#5399)

* kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381)

* chore: bump sgl-kernel 0.0.9 (sgl-project#5400)

* chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401)

* Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406)

* Fix bench_serving with random-ids (sgl-project#5214)

* [misc] fix ci flaky case (sgl-project#5352)

* [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412)

* Support dynamic connection and TP 16 (sgl-project#5351)

Co-authored-by: luoyuan.luo <[email protected]>

* Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416)

* [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415)

Signed-off-by: Shangming Cai <[email protected]>
Co-authored-by: ybyang <[email protected]>

* Distinguish bootstrap key only in decode server (sgl-project#5422)

* [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423)

* [minor] cleanup cmakelists.txt (sgl-project#5420)

* bugfix: fix merge_state_v2 cuda graph (sgl-project#5419)

* chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430)

* fix: solve release issue (sgl-project#5434)

* BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431)

* feat: update model_specific_adjustment (sgl-project#5344)

Co-authored-by: hebiao064 <[email protected]>

* chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436)

* Fix ignore_eos parameter when loading a chat template (sgl-project#5264)

* add attention backend supporting matrix in the doc (sgl-project#5211)

Co-authored-by: Stefan He <[email protected]>

* Support BNB quantization for llama/mllama (sgl-project#5038)

Co-authored-by: Yuhao Yang <[email protected]>

* [Docs] Update start/install.md (sgl-project#5398)

* [Minor] Move torch.compile patch to a better place (sgl-project#5397)

* [Bug fix] need record start time in pd mode (sgl-project#5425)

* Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113)

* chore: bump v0.4.5.post1 (sgl-project#5445)

* Fix several minor issues in PD disaggregation (sgl-project#5444)

* [doc] Update benchmark_and_profiling.md (sgl-project#5449)

* Update cutlass dependency. (sgl-project#5447)

* add multi-lora feature in README.md (sgl-project#5463)

* Clean up imports (sgl-project#5467)

* [verl] Modify the update_weights func to align with verl's resharding (sgl-project#5345)

Co-authored-by: Chayenne <[email protected]>

* [Model Support] unsloth/Phi-4-mini bnb model (sgl-project#4982)

Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* Update attention_backend.md: plural form (sgl-project#5489)

* Add test for flash_attn_varlen_func kernel (sgl-project#5484)

* Deprecate disable-mla (sgl-project#5481)

* Deprecate enable-flashinfer-mla and enable-flashmla (sgl-project#5480)

* Feat/support encoder model (like bert) (sgl-project#4887)

* Enable local attention during decode (sgl-project#5479)

* Refactor DeepSeek decoder layer branches (sgl-project#5205)

* Fix a link in sgl-kernel/README.md (sgl-project#5493)

* [Bug fix] use correct func path in deepseek (sgl-project#5496)

Signed-off-by: Xuchun Shang <[email protected]>

* Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (sgl-project#5503)

* [Feat] Update sgl-kernel flashinfer to latest main version (sgl-project#5500)

Co-authored-by: zhyncs <[email protected]>

* Fix: Incorrect parameters passed to forward_batch_generation (sgl-project#5506) (sgl-project#5511)

* Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (sgl-project#5426)

Co-authored-by: ocss884 <[email protected]>

* [docs] Fix several consistency issues in sampling_params.md (sgl-project#5373)

Signed-off-by: windsonsea <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>

* Configuration qwen2_moe.py - qkv_bias now in transformers (sgl-project#5512)

* Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (sgl-project#4836)

* Sgl kernel fused_moe_gate support n_shared_experts (sgl-project#5440)

* chore: bump sgl-kernel 0.0.9.post2 (sgl-project#5518)

* use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (sgl-project#5473)

Co-authored-by: Zhang Kaihong <[email protected]>

* fix kimi vl running bug after rebase main (sgl-project#5461)

* fix bug of VLLM_AVAILABLE not defined (sgl-project#5497)

* Avoid computing lse in Ragged Prefill when there's no prefix. (sgl-project#5476)

Co-authored-by: Baizhou Zhang <[email protected]>

* [Model] Adding Qwen3 and Qwen3MoE (sgl-project#4693)

* fix util import (sgl-project#5542)

* Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (sgl-project#5544)

* chore: upgrade sgl-kernel 0.0.9.post2 (sgl-project#5540)

* Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (sgl-project#5340)

* Make profiler output file names consistent (sgl-project#5548)

* [PD] Tiny fix timeout error when generate (sgl-project#5545)

* [PD] Fix no cache connect for recevier (sgl-project#5534)

* feat: use flashinfer jit package (sgl-project#5547)

* [PD] Remove the requirement of config file for mooncake backend  (sgl-project#5460)

* restruct compressed_tensors_w8a8_fp8 (sgl-project#5475)

* simplify the control logic for using shared experts fusion (sgl-project#5504)

* Remove one kernel in per_tensor_quant_mla_fp8 (sgl-project#5549)

* Fix sampler nan check when calling top_k_top_p_sampling_from_probs (sgl-project#5546)

* [PD] Support page size > 1 (sgl-project#5561)

* fix hicache write back (sgl-project#5543)

* Minor update for ROCm variable style (sgl-project#5562)

* Fix bench_one_batch producing unnatural results for expert parallel (sgl-project#5149)

* [perf] introduce deep gemm group_gemm_masked as bmm (sgl-project#5432)

* [PD] Fix DeepSeek cannot be run on latest master (sgl-project#5568)

* Fix BumpAllocator error when no input_ids (sgl-project#5564)

* enable DeepSeek V3 shared_experts_fusion in sm90 (sgl-project#5571)

* [Fix] fix outlines and xgrammar (sgl-project#4947)

* [Doc]Add instruction for profiling with bench_one_batch (sgl-project#5581)

* Release v0.4.5.post2 (sgl-project#5582)

* Fix bench_serving fail when zero warmup requests (sgl-project#5574)

* Fix DeepEP cannot run on latest master (sgl-project#5567)

* Fix torch memory saver not enabled in DP scenario (sgl-project#5560)

* Super tiny fix typo (sgl-project#5559)

* Add document for LoRA serving (sgl-project#5521)

* Tiny improve error message (sgl-project#5526)

* [PD] Fix server crash when using batch requests (sgl-project#5531)

* [Feat] upgrade pytorch2.6 (sgl-project#5417)

* Fix enable chunked prefill for Llama4 (sgl-project#5575)

* fix: use fa3 for gemma2 (sgl-project#5586)

* Fix ChatCompletionMessageGenericParam to allow for None content (sgl-project#5452)

* [PD] Fix large page size + chunk prefill (sgl-project#5588)

* Add test config yamls for Deepseek v3 (sgl-project#5433)

* [Feature] Prefill assistant response - add continue_final_message parameter (sgl-project#4226)

Co-authored-by: Chayenne <[email protected]>

* add function call parser for DeepSeek V3 (sgl-project#5224)

* smaller and non gated models for docs (sgl-project#5378)

* Feat: Implement JSON Mode (response_format.type="json_object") (sgl-project#4733)

Co-authored-by: Kyle Pena <[email protected]>

* check marlin format before attempting conversion (sgl-project#4675)

* compressed_tensors: port w8a16 fp8 from vllm (sgl-project#4852)

* Fix one more issue reported by torchfix (sgl-project#4859)

* Add sanity check for max_running_requests (sgl-project#5016)

* Correct grafana heatmap. (sgl-project#5019)

* Perform Batch Tokenization. (sgl-project#5141)

* Speedup shared expert weight construction by avoid cloning (sgl-project#5188)

* Tiny add Engine.flush_cache API (sgl-project#5241)

* [misc] remove is_cuda_available (sgl-project#5319)

* Fix flush cache (sgl-project#5590)

* Add Speculative Decoding Eagle3 topk > 1 (sgl-project#5318)

Co-authored-by: Stefan He <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>

* upstream hicache fixes (sgl-project#5570)

* Tiny add warning when cannot recognize bool env var (sgl-project#5348)

* Modify metrics service endpoint (sgl-project#3443)

* Update protocol.py to fix sgl-project#4589 (sgl-project#4590)

* [Feat.] Enable grafana to show metrics (sgl-project#4718)

Co-authored-by: zhaochenyang20 <[email protected]>

* [Fix] Enhance DP Attention for IPv6 Compatibility (sgl-project#4937)

* Support o1 model on Azure (sgl-project#4980)

Co-authored-by: Shan Yu <[email protected]>

* Tiny remove duplicated code (sgl-project#5021)

* Tiny update error hint (sgl-project#5037)

* Support PD bootstrap fields on /v1/chat/completions endpoint (sgl-project#5488)

* [PD] Fix generate endpoint of min_lb for PD (sgl-project#5598)

Signed-off-by: Shangming Cai <[email protected]>

* [PD] Fix edge case and simplify large page size + chunked prefill (sgl-project#5589)

* [PD] Add NIXL transfer backend  (sgl-project#5477)

* [PD] Support decode overlap schedule (sgl-project#5608)

* [PD] Support prefill overlap + Ensure no race condition (sgl-project#5609)

* Enhance GPU memory settings (sgl-project#5604)

* [feature] enable pre compile jit deep_gemm (sgl-project#5580)

* Clean up mem settings (sgl-project#5610)

* Support aiter RMSNorm in AMD (sgl-project#5510)

Co-authored-by: JieXin Liang <[email protected]>

* chore: bump v0.4.5.post3 (sgl-project#5611)

* Remove extra copy in deepseek forward absorb (sgl-project#5578)

Co-authored-by: saienduri <[email protected]>

* [Doc] Fix a 404 link to llama-405b (sgl-project#5615)

Signed-off-by: windsonsea <[email protected]>

* [fix] force use deepgemm in compile_deep_gemm (sgl-project#5618)

* [fix] fix compile_deep_gemm missing kv_b_proj (sgl-project#5620)

* fix: gemma 3 not use softcap (sgl-project#5622)

* Fix FA3 DeepSeek prefill performance regression (sgl-project#5624)

Co-authored-by: ispobock <[email protected]>

* [NFC] Remove duplicate `compressed-tensors` (sgl-project#5640)

* Fix shared experts fusion error without quantization (sgl-project#5632)

* [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (sgl-project#5641)

Co-authored-by: yuethe <[email protected]>

* fix flashmla bug (sgl-project#5272)

* [fix] reduce dp capture bs (sgl-project#5634)

Co-authored-by: alcanerian <[email protected]>

* Remove q concat in FA3 backend for DeepSeek decode (sgl-project#5638)

* Revert "Support aiter RMSNorm in AMD" (sgl-project#5646)

* fix: update bench_speculative (sgl-project#5649)

* Turn on DeepGemm By Default and Update Doc (sgl-project#5628)

* Fuse q_a_proj and kv_a_proj (sgl-project#5619)

* Remove unnecessary `torch.full` in DeepSeek (sgl-project#5601)

* [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell (sgl-project#5281)

* fix sgl-kernel unit tests (sgl-project#5666)

* fix awq_dequantize import (sgl-project#5669)

* Integrating PD disaggregation with DP attention and DeepEP (sgl-project#5435)

Co-authored-by: Byron Hsu <[email protected]>

* fix gemma3 unit test (sgl-project#5670)

* fix torchvision::nms not exist (sgl-project#5671)

* [PD] Add support for dp attention with mooncake (sgl-project#5530)

Signed-off-by: Shangming Cai <[email protected]>

* tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py (sgl-project#5677)

* [Doc] Fix two 404 links caused by sglang typo (sgl-project#5667)

Signed-off-by: windsonsea <[email protected]>

* fix: update truss bench_serving (sgl-project#5683)

* fix: only compile ApplyTokenBitmaskInplace cu124+ (sgl-project#5686)

* chore: bump sgl-kernel 0.1.0 (sgl-project#5688)

* vlm: enable radix cache for qwen-vl models (sgl-project#5349)

Co-authored-by: Xinyuan Tong <[email protected]>

* [BugFix] Fix combination of MTP and `--n-share-experts-fusion`with R1 (sgl-project#5707)

* Fix weight loading bug for Deepseek v3+nextn (sgl-project#5684)

* Add example to use sgl engine with fastapi (sgl-project#5648)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* [Doc] Fix a link to Weilin Zhao (sgl-project#5706)

Signed-off-by: windsonsea <[email protected]>

* Add MMMU benchmark results (sgl-project#4491)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* [Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct) (sgl-project#5078)

Co-authored-by: vincent-4 <[email protected]>

* [PD] Better logs (sgl-project#5715)

* [PD] Add kvargs table and thread pool for kvcache sender of mooncake (sgl-project#5738)

Signed-off-by: Shangming Cai <[email protected]>

* [PD]: Support Muti Prefill in one node (sgl-project#5704)

Co-authored-by: shuaills <[email protected]>

* Fix: deepseek forward absorb (sgl-project#5723)

Co-authored-by: ispobock <[email protected]>

* Pin torch audio to 2.6.0 (sgl-project#5750)

* Revert "[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct)" (sgl-project#5754)

* Disable flaky eagle tests (sgl-project#5753)

* update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. (sgl-project#5740)

* [Docs] Update runtime/engine/readme.md (sgl-project#5737)

Signed-off-by: windsonsea <[email protected]>

* Reorder loop in shared expert weight loading (sgl-project#5719)

* fix: fix one more bug from merging mm_inputs (sgl-project#5718)

Co-authored-by: Xinyuan Tong <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>

* [Fix]: support deepseek-vl2-tiny model (sgl-project#5552)

Co-authored-by: bppps <[email protected]>

* Bugfix for minicpmo vision test (sgl-project#5760)

* [Minor] fix documentations (sgl-project#5756)

* Add an assertion to enhance the robustness of the operator (sgl-project#5736)

* fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 (sgl-project#5733)

* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <[email protected]>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <[email protected]>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <[email protected]>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <[email protected]>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <[email protected]>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* fix for hpu backend in model runner and server args

Signed-off-by: Mohit Sinha <[email protected]>

* rebase formatting issue

Signed-off-by: Mohit Sinha <[email protected]>

* [SW-228218]: Fix device mismatch in frequency penalty.

Ensure tensors in BatchedFrequencyPenalizer are on the same device by
moving output_ids and frequency_penalties to the device of
cumulated_frequency_penalties. This resolves a RuntimeError
caused by tensors on cpu and hpu:0 during logits subtraction.

---------

Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Xuchun Shang <[email protected]>
Signed-off-by: windsonsea <[email protected]>
Signed-off-by: Kebe <[email protected]>
Signed-off-by: congcongke <[email protected]>
Signed-off-by: Mohit Sinha <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: DefTruth <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: Zhaoyang Hao <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: lambert0312 <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: hebiao064 <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: mRSun15 <[email protected]>
Co-authored-by: ryang <[email protected]>
Co-authored-by: Yuhao Yang <[email protected]>
Co-authored-by: Michael Yao <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Elfie Guo <[email protected]>
Co-authored-by: Ying Sheng <[email protected]>
Co-authored-by: BearBiscuit <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: eigen <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Didier Durand <[email protected]>
Co-authored-by: woodx <[email protected]>
Co-authored-by: Xuchun Shang <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: PGFLMG <[email protected]>
Co-authored-by: u4lr451 <[email protected]>
Co-authored-by: ocss884 <[email protected]>
Co-authored-by: Michael Feil <[email protected]>
Co-authored-by: strgrb <[email protected]>
Co-authored-by: Zhang Kaihong <[email protected]>
Co-authored-by: liwenju0 <[email protected]>
Co-authored-by: Wenxuan Tan <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: Byron Hsu <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: Zhaoyi Li <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: tarinkk <[email protected]>
Co-authored-by: AmadeusW <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>
Co-authored-by: Yi Zhou <[email protected]>
Co-authored-by: simveit <[email protected]>
Co-authored-by: kyle-pena-kuzco <[email protected]>
Co-authored-by: Kyle Pena <[email protected]>
Co-authored-by: Enrique Shockwave <[email protected]>
Co-authored-by: Juwan Yoo <[email protected]>
Co-authored-by: Brayden Zhong <[email protected]>
Co-authored-by: mac0ne <[email protected]>
Co-authored-by: Sundara Raman Ramachandran <[email protected]>
Co-authored-by: Qingquan Song <[email protected]>
Co-authored-by: moontidef <[email protected]>
Co-authored-by: Huapeng Zhou <[email protected]>
Co-authored-by: Lucius <[email protected]>
Co-authored-by: Chuyue Sun <[email protected]>
Co-authored-by: Shan Yu <[email protected]>
Co-authored-by: Yongtong Wu <[email protected]>
Co-authored-by: michael-amd <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: saienduri <[email protected]>
Co-authored-by: ispobock <[email protected]>
Co-authored-by: Connector Switch <[email protected]>
Co-authored-by: saltyfish66 <[email protected]>
Co-authored-by: yuethe <[email protected]>
Co-authored-by: alcanerian <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Xinyuan Tong <[email protected]>
Co-authored-by: Ravi Theja <[email protected]>
Co-authored-by: Ravi Theja Desetty <[email protected]>
Co-authored-by: vincent-4 <[email protected]>
Co-authored-by: IAN <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>
Co-authored-by: ZXN <[email protected]>
Co-authored-by: bppps <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: Kyungmin Lee <[email protected]>
Co-authored-by: vzed <[email protected]>
Co-authored-by: DavidBao <[email protected]>
Co-authored-by: Frankey_8080 <[email protected]>
Co-authored-by: yan97ao <[email protected]>
Co-authored-by: aoshen524 <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: Kebe <[email protected]>
Co-authored-by: zhanweidu <[email protected]>
Co-authored-by: NoahM <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: JiLi <[email protected]>
Co-authored-by: sighingnow <[email protected]>
Co-authored-by: XTY <[email protected]>
Co-authored-by: vikram singh shekhawat <[email protected]>
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 23, 2025
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <[email protected]>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <[email protected]>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <[email protected]>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <[email protected]>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <[email protected]>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857)

* simplify fused_moe config logging (sgl-project#5801)

* [CI] tune the test order to warmup the server (sgl-project#5860)

* Cutlass MLA decode - fix dtype error (sgl-project#5868)

* cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

* [Feature] support auto chat template (sgl-project#4949)

* Feat: support cuda graph for LoRA (sgl-project#4115)

Co-authored-by: Beichen Ma <[email protected]>

* Add qwen3 30b fused moe config (sgl-project#5859)

* [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875)

Co-authored-by: pengcuo <[email protected]>

* Add A800 fused moe config for qwen3 30b (sgl-project#5880)

* [Misc] add service discovery for sgl router

* [fix]: PyO3 macOS linking and consolidate on tracing for logging

* chore: update Dockerfile (sgl-project#5894)

* [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836)

* [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841)

* chore: update CODEOWNERS (sgl-project#5895)

* [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746)

* [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896)

* Auto set draft model path for MTP (sgl-project#5793)

* [fix] relax mem_fraction_static for h200 (sgl-project#5893)

Co-authored-by: alcanerian <[email protected]>

* feat: support pythonic tool call and index in tool call streaming (sgl-project#5725)

* [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696)

* Add AMD MI300x Nightly Testing. (sgl-project#5861)

* chore: use torch 2.6 for sgl-kernel build (sgl-project#5898)

* Fix check_env script (sgl-project#5901)

* [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830)

* Bump Flashinfer to 0.2.5 (sgl-project#5870)

Co-authored-by: Yuhao Chen <[email protected]>

* [Fix] Unload lora in HF_Runner if needed (sgl-project#5899)

* Add A800 fused moe config for qwen3 235b (sgl-project#5900)

* Add sm_120 for blackwell (sgl-project#5903)

* [Feature] add support kimi vl model (sgl-project#5383)

Co-authored-by: wenju.li <[email protected]>

* support vlm benchmark profile (sgl-project#5905)

* [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910)

* [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919)

* [qwen3] support qwen3 ep moe (sgl-project#5917)

Co-authored-by: sleepcoo <[email protected]>

* Add TP2 MOE benchmarks for AMD. (sgl-project#5909)

* [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912)

Co-authored-by: zhyncs <[email protected]>

* chore: bump sgl-kernel 0.1.1 (sgl-project#5932)

* chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933)

* Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783)

* [PP] Add pipeline parallelism (sgl-project#5724)

* Fix lora batch processing when input lora_path contains None (sgl-project#5930)

* add Thor & Spark (sgl-project#5915)

* fix: correct stream response when enable_thinking is set to false (sgl-project#5881)

* fix: update model runner (sgl-project#5934)

* chore: bump v0.4.6.post2 (sgl-project#5939)

* Support XiaomiMiMo/MiMo model inference (sgl-project#5921)

* [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834)

Co-authored-by: luoyuan.luo <[email protected]>

* Remove extra contiguous (sgl-project#5953)

* Update ci test and doc for MTP api change (sgl-project#5952)

* docs: Fix Qwen model typo (sgl-project#5944)

Signed-off-by: JiangJiaWei1103 <[email protected]>

* Optimize a pad operation to accelerate 25us (sgl-project#5945)

* Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956)

* feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782)

* Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960)

* feat: Refactor DeepSeekV3 function call (sgl-project#5908)

* Remove token in token out in Native API (sgl-project#5967)

* Support InternVL3 (sgl-project#5350)

Co-authored-by: Mick <[email protected]>
Co-authored-by: Chayenne <[email protected]>

* Support MMMU benchmark for  InternVL (sgl-project#5968)

* FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969)

Signed-off-by: Lifu Huang <[email protected]>

* [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681)

* Fix set kv cache multi-stream (sgl-project#5975)

* Overlap qk norm with two streams (sgl-project#5977)

* fix: only upgrade nccl for cu128 (sgl-project#5986)

* Fix Phi3 serving which was broke by earlier change (sgl-project#5991)

Co-authored-by: Lifu Huang <[email protected]>

* [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998)

* [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992)

* [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012)

Signed-off-by: Lifu Huang <[email protected]>

* Fix flaky issues of lora and add multi batch tests (sgl-project#5957)

* Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679)

* fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997)

* [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002)

* Update dev container config to support live code sync and improve docker setup guide   (sgl-project#6018)

Signed-off-by: Lifu Huang <[email protected]>

* [PD] Optimize disaggregation ib device help info (sgl-project#5781)

* [Test] Add flashmla attention backend test (sgl-project#5587)

* Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555)

* feat: Add a unified merge_state API (sgl-project#5428)

* feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996)

* [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752)

* Fix prefill OOM error in the case of large page size (sgl-project#5081)

* Fix problem of large page size with chunked prefill (sgl-project#6046)

* docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047)

* docs: add new blog (sgl-project#6048)

* Fix not "import os" (sgl-project#6057)

* Better PD initialization (sgl-project#5751)

* fix: deepep dockerfile, use pip install deepep. (sgl-project#5885)

* [Fix] Fix and rename flashmla CI test (sgl-project#6045)

* chore: upgrade cutlass 3.9.2 (sgl-project#6004)

Co-authored-by: yizhang2077 <[email protected]>

* Fix sgl-kernel build on aarch64 platforms (sgl-project#6062)

* Add DeepEP to CI PR Test (sgl-project#5655)

Co-authored-by: Jinyan Chen <[email protected]>

* fix custom_allreduce namespace (sgl-project#6039)

* feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010)

Co-authored-by: Qiaolin-Yu <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* [Feature] Support for Ascend NPU backend (sgl-project#3853)

Signed-off-by: Song Zhang <[email protected]>
Co-authored-by: 22dimensions <[email protected]>

* Fix the timeout for 8 gpu tests (sgl-project#6084)

* Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014)

* Super tiny fix doc (sgl-project#5233)

* [Doc]Fix description for dp_size argument (sgl-project#6063)

* feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075)

* [refactor] slightly tidy fp8 module (sgl-project#5993)

* Clean up fa3 test from 8 gpus (sgl-project#6105)

* Deferring 8 GPU test (sgl-project#6102)

* Update doc for MLA attention backends (sgl-project#6034)

* Clean logs for DeepSeek-V3 launching (sgl-project#6079)

* [CI]Add performance CI for VLM (sgl-project#6038)

Signed-off-by: Xinyuan Tong <[email protected]>

* adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111)

* optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077)

* Overlap shared expert and routed expert computations (sgl-project#5121)

* Tiny refactor ModelConfig.from_server_args (sgl-project#5219)

* Tiny refactor weight loading logic (sgl-project#5232)

* [PD] Add control to slow down a server (sgl-project#5572)

* Change AMD test threshold (sgl-project#6091)

* DeepEP normal support deepgemm-contiguous (sgl-project#5626)

Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Xuting Zhou <[email protected]>
Co-authored-by: ZhengHSI <[email protected]>

* [fix] fix pyproject.toml dependencies (sgl-project#6119)

* [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764)

Co-authored-by: othame <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>

* [perf] dsv3 bmm fallback to bf16 (sgl-project#5662)

* [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097)

* [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123)

Co-authored-by: zhyncs <[email protected]>

* upgrade xgrammar to 0.1.19 (sgl-project#6129)

* Remove unecessary is_fa3_supported check (sgl-project#6112)

* chore: bump sgl-kernel 0.1.2 (sgl-project#6131)

* docs: update README (sgl-project#6132)

* [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745)

* Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101)

* opt flashinfer mla cat (sgl-project#5822)

Co-authored-by: xuyongfei.xyf <[email protected]>

* Update amd nightly concurrency. (sgl-project#6141)

* feat: add thinking_budget (sgl-project#6089)

* [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162)

* fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778)

Co-authored-by: Zhiqiang Xie <[email protected]>

* chore: bump v0.4.6.post3 (sgl-project#6165)

* KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016)

Co-authored-by: 继优 <[email protected]>
Co-authored-by: chus-chus <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>

* [fix] fix determine_n_share_experts_fusion (sgl-project#6118)

* Fix and Clean up chat-template requirement for VLM (sgl-project#6114)

Signed-off-by: Xinyuan Tong <[email protected]>

* [Docs]Delete duplicate content (sgl-project#6146)

Co-authored-by: ximing.wxm <[email protected]>

* Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181)

* Added async_encode method to Engine (sgl-project#4701)

* Fix data parallel perf regression (sgl-project#6183)

* Fix request abortion (sgl-project#6184)

* Add typo checker in pre-commit (sgl-project#6179)

Co-authored-by: Brayden Zhong <[email protected]>

* Remove duplicate IO Struct test (sgl-project#6180)

Signed-off-by: Emmanuel Ferdman <[email protected]>

* [PD] Add simple unit test for disaggregation feature (sgl-project#5654)

Signed-off-by: Shangming Cai <[email protected]>

* [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186)

* feat: support loogle eval (sgl-project#6190)

* [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191)

* fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169)

* chore: upgrade deepgemm (sgl-project#6073)

* chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195)

* chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196)

Co-authored-by: alcanderian <[email protected]>

* Handle empty input string for embedding models (sgl-project#5621)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199)

* [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032)

* Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188)

* [CI] Reorganize the 8 gpu tests (sgl-project#6192)

* Add dev-deepep docker image (sgl-project#6198)

* Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178)

Signed-off-by: Lifu Huang <[email protected]>

* Update README.md (sgl-project#6202)

* Fix release-docs.yml to not use python 3.9 (sgl-project#6204)

* Fix start_profile does not support with_stack and record_shapes (sgl-project#6043)

* [doc] add a note for --n-share-experts-fusion args (sgl-project#6154)

* Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558)

Co-authored-by: liusy58 <[email protected]>

* Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213)

* Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201)

Co-authored-by: SangBin Cho <[email protected]>

* [CI] Fix PD mooncake dependency error (sgl-project#6212)

Signed-off-by: Shangming Cai <[email protected]>

* [CI] Re-enable pd disaggregation test (sgl-project#6231)

Signed-off-by: Shangming Cai <[email protected]>

* fix some typos (sgl-project#6209)

Co-authored-by: Brayden Zhong <[email protected]>

* [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206)

* [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223)

* Revert "fix some typos" (sgl-project#6244)

* chore: add hf_xet dep (sgl-project#6243)

* Update AMD nightly deps. (sgl-project#6241)

* [PD] Add support for different TP sizes per DP rank (sgl-project#5922)

Signed-off-by: Shangming Cai <[email protected]>

* Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225)

Co-authored-by: SangBin Cho <[email protected]>

* fix typo (sgl-project#6248)

* Support tuning moe for llama 4 model (sgl-project#6042)

* Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251)

* [Llama4] Add docs note about enable multimodal (sgl-project#6235)

* [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247)

* Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657)

Co-authored-by: liusy58 <[email protected]>
Co-authored-by: 颉沆 <[email protected]>

* model(vlm): pixtral (sgl-project#5084)

* [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252)

* Enable MI325X AMD CI. (sgl-project#6259)

* chore: bump v0.4.6.post4 (sgl-project#6245)

* formatting fix for the rebased commit for 4.6.0_post4

Signed-off-by: Mohit Sinha <[email protected]>

* fix issues in model runner and python packages

fix for following issues:
> vLLM dependency for xgrammar==0.1.17
> 'Scheduler' object has no attribute 'device
> 'pp_proxy_tensors' unexpected arg in HPUGraphRunner
> TODO: Add pipeline parallelism support in HPUGraphRunner

Signed-off-by: Mohit Sinha <[email protected]>

* fix formatting in model runner

Signed-off-by: Mohit Sinha <[email protected]>

* base grammar fix for the is_terminated case

>  'OutlinesGrammar' object has no attribute 'is_terminated'

Signed-off-by: Mohit Sinha <[email protected]>

---------

Signed-off-by: Kebe <[email protected]>
Signed-off-by: congcongke <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: Lifu Huang <[email protected]>
Signed-off-by: Song Zhang <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Emmanuel Ferdman <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Mohit Sinha <[email protected]>
Co-authored-by: Wenxuan Tan <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: saltyfish66 <[email protected]>
Co-authored-by: vzed <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: saienduri <[email protected]>
Co-authored-by: DavidBao <[email protected]>
Co-authored-by: Frankey_8080 <[email protected]>
Co-authored-by: Stefan He <[email protected]>
Co-authored-by: yan97ao <[email protected]>
Co-authored-by: aoshen524 <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: lambert0312 <[email protected]>
Co-authored-by: Kebe <[email protected]>
Co-authored-by: zhanweidu <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Huapeng Zhou <[email protected]>
Co-authored-by: NoahM <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: Michael Yao <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: JiLi <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: PGFLMG <[email protected]>
Co-authored-by: sighingnow <[email protected]>
Co-authored-by: XTY <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: woodx <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: Beichen Ma <[email protected]>
Co-authored-by: pengcuo <[email protected]>
Co-authored-by: pengcuo <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>
Co-authored-by: simveit <[email protected]>
Co-authored-by: Johnny <[email protected]>
Co-authored-by: alcanerian <[email protected]>
Co-authored-by: Yuhao Chen <[email protected]>
Co-authored-by: zhjunqin <[email protected]>
Co-authored-by: liwenju0 <[email protected]>
Co-authored-by: wenju.li <[email protected]>
Co-authored-by: laixin <[email protected]>
Co-authored-by: sleepcoo <[email protected]>
Co-authored-by: Ying Sheng <[email protected]>
Co-authored-by: ryang <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: 江家瑋 <[email protected]>
Co-authored-by: KCFindstr <[email protected]>
Co-authored-by: xm:D <[email protected]>
Co-authored-by: Lifu Huang <[email protected]>
Co-authored-by: Yongtong Wu <[email protected]>
Co-authored-by: Junrong Lin <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: DefTruth <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: Hank Han <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: Jinyan Chen <[email protected]>
Co-authored-by: Jinyan Chen <[email protected]>
Co-authored-by: Johnny <[email protected]>
Co-authored-by: Song Zhang <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: ishandhanani <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Minglei Zhu <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Xuting Zhou <[email protected]>
Co-authored-by: ZhengHSI <[email protected]>
Co-authored-by: Zhu Chen <[email protected]>
Co-authored-by: othame <[email protected]>
Co-authored-by: Hubert Lu <[email protected]>
Co-authored-by: Yixin Dong <[email protected]>
Co-authored-by: xu-yfei <[email protected]>
Co-authored-by: xuyongfei.xyf <[email protected]>
Co-authored-by: thyecust <[email protected]>
Co-authored-by: huangtingwei <[email protected]>
Co-authored-by: Simon (Jiyou) Li <[email protected]>
Co-authored-by: 继优 <[email protected]>
Co-authored-by: chus-chus <[email protected]>
Co-authored-by: Ximingwang-09 <[email protected]>
Co-authored-by: ximing.wxm <[email protected]>
Co-authored-by: Steven Shimizu <[email protected]>
Co-authored-by: applesaucethebun <[email protected]>
Co-authored-by: Brayden Zhong <[email protected]>
Co-authored-by: Emmanuel Ferdman <[email protected]>
Co-authored-by: Yusong Gao <[email protected]>
Co-authored-by: alcanderian <[email protected]>
Co-authored-by: Ravi Theja <[email protected]>
Co-authored-by: Ravi Theja Desetty <[email protected]>
Co-authored-by: liusy58 <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: 颉沆 <[email protected]>
Co-authored-by: Kiv Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants