Skip to content

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834

Open
jasl wants to merge 34 commits into
vllm-project:mainfrom
jasl:codex/ds4-sm120-min-enable
Open

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834
jasl wants to merge 34 commits into
vllm-project:mainfrom
jasl:codex/ds4-sm120-min-enable

Conversation

@jasl
Copy link
Copy Markdown
Contributor

@jasl jasl commented May 6, 2026

Purpose

Enable DeepSeek V4 Flash on SM12x Blackwell consumer hardware
(RTX PRO 6000 Workstation Edition, RTX 5090, DGX Spark GB10).
The core challenge: SM12x lacks the TMEM / tcgen05 instructions
present on datacenter Blackwell (SM10x), so DeepGEMM, FlashMLA,
and Marlin's FP8 paths fail at kernel link time on this hardware.
This PR provides pure-PyTorch fallbacks, Triton kernel
implementations, and SM12x-specific tuning so the model runs
end-to-end with production-quality perf.

Validation results

Hardware: 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition
(TP=2 EP), jasl/vllm at the head of this PR
c92696943.
Rebased on upstream/main 2026-05-17 — absorbs FlashInfer 0.6.11.post2
(native CUDA 13), the SM121 Marlin/CUTLASS family-check
(vllm-project/vllm#35568), and the breakable CUDA-graph + LoRA MoE-gate
upstream fixes. NCCL: nvidia-nccl-cu13 ≥ 2.30.4 (DGX Spark GB10
reliability fix, also stabilises sustained multi-stream collectives on
RTX PRO 6000 WS).

Serve: --kv-cache-dtype fp8 --block-size 256 --max-model-len 65536 --tensor-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.98 --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'.

--no-enable-prefix-caching set for the baseline so MTP-K
comparison numbers aren't biased by cache hits (a companion
cache-on run is the next follow-up).

Accuracy

lm_eval gsm8k 5-shot, 200 questions, temperature=0,
max_gen_toks=2048, via /v1/completions:

Variant strict-match
no-MTP 95.5 %
MTP=2 95.0 %

Within the historical 0.948–0.965 band on this model.

Performance (mt-bench, philschmid/mt-bench, 80 prompts)

c no-MTP TPOT (ms) no-MTP tok/s no-MTP TTFT med (ms) MTP=2 TPOT (ms) MTP=2 tok/s MTP=2 TTFT med (ms)
1 9.9 98 53 5.7 165 59
2 11.7 163 72 7.5 248 84
4 14.4 266 77 11.8 316 100
8 19.4 380 86 13.7 530 117
16 27.3 520 120 19.3 721 161
24 32.6 607 391 23.2 846 194

MTP=2 peak: 165 tok/s single-stream, 846 tok/s @ c=24.
MTP=2 acceptance length 2.35–2.38 on real-content prompts, pos-0
acceptance 84–85 %.

Long-context prefill (added under PR #6 and gather-overlap)

PR #6 (_accumulate_indexed_attention_chunk_multihead_kernel,
HEAD_BLOCK=8) drops prefill TTFT 20–23 % vs the parent on long
context. The follow-on commit overlaps the C128A prefill KV gather
with the indexer forward on aux_stream[1], hiding the
dequantize_and_gather_k_cache cost behind the indexer.

vs PR #6 only (no gather-overlap) on random isl=4K/8K osl=512:

Workload TTFT median (ms) TPOT (ms) Δ TTFT Δ TPOT
isl=4K c=1 745 → 749 7.0 → 6.7 0 % -5 %
isl=4K c=16 5108 → 4949 34.7 → 34.3 -3 % -1 %
isl=8K c=1 1706 → 1708 7.0 → 7.0 0 % 0 %
isl=8K c=16 10509 → 10219 54.5 → 54.2 -3 % -1 %

The gather-overlap commit also lifts mt-bench at c=24: no-MTP
604 → 635 tok/s (+5 %), MTP=2 854 → 877 tok/s (+2.7 %). The author
reports +5 % on top of PR #6 at 128K context; at 65K we see
+2–5 % which is consistent with the gain scaling with context
length.

Acceptance (toolcall-15 scenario battery)

Variant score failures
no-MTP 91 % 13 / 135 cases
MTP=2 92 % 12 / 135 cases

This is the first SM12x baseline that evaluates thinking-mode
correctly. Two prior harness bugs masked thinking-mode entirely
across every earlier retry:

  1. The harness was sending extra_body.thinking={"type":"enabled"}
    at the top level, which is the Claude API shape. vLLM's
    DSv4 chat-template entry reads chat_template_kwargs.thinking
    instead, so every request silently routed to chat mode. Fixed
    by 323aa1f (confirmed in [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes #41834 comment by
    qym-ll).
  2. The transcript / replay path read message.reasoning_content,
    but the vLLM 0.x OpenAI front-end populates message.reasoning
    on this build. Same commit normalised both keys.

Before the fix, every "think-high" / "think-max" request produced
zero reasoning tokens and was indistinguishable from non-thinking.
After the fix, think-high and think-max generate 105/105 cases
with non-empty reasoning content on the same prompt set, and
the toolcall-15 score lifts ~6 pp on both variants vs the prior
baseline. The remaining failures stay concentrated in TC-06
(Multi-Value Extraction, 7/7 across modes) plus scattered TC-11 /
TC-14 / TC-15 — characteristic helpfulness-bias / deflect-rather-
than-refuse model behaviours, not regressions.

Comparison to DeepSeek's official hosted API

Same prompts run against api.deepseek.com/v1/chat/completions
with model=deepseek-v4-flash, same temperature=1.0 top_p=1.0,
and the same thinking-mode shape:

Source toolcall-15 score failures / cases
DeepSeek hosted API 96 % 2 / 45 (1 round)
This PR, MTP=2 92 % 12 / 135 (3 rounds)
This PR, no-MTP 91 % 13 / 135 (3 rounds)

Per-case failure rate: hosted 4.4 %, this PR 8.9–9.6 %. The hosted
service either ships a checkpoint we haven't pulled from the HF
release (likely — the same deepseek-ai/DeepSeek-V4-Flash weights
gave 100 % deflection on TC-14 locally and 0 % failures on the
hosted endpoint), or injects an internal tool-use system prompt.
Either way the local vs hosted gap on this PR is the smallest it
has been in any baseline we've shipped.

vs 2026-05-12 deployment baseline (1c20f1a6d, same hardware)

Metric 2026-05-12 2026-05-17 (this PR) Δ
no-MTP mt-bench c=1 tok/s 89 98 +10 %
MTP=2 mt-bench c=1 tok/s 137 165 +20 %
no-MTP mt-bench c=24 tok/s 557 607 +9 %
MTP=2 mt-bench c=24 tok/s 706 846 +20 %

Known caveats

  • MTP=1 NCCL allgather hang under sustained multi-stream load
    (collective _ALLGATHER_BASE watchdog 600 s) — reproduced once
    in earlier baselines at c=4 mid-bench. Not in our patch surface
    (Torch NCCL ProcessGroupWatchdog), likely a spec-decode K=1
    sync edge case. MTP=1 demoted to smoke-tier pending repro on
    NCCL 2.30.4 + upstream issue.
  • MTP=3 demoted to smoke-tier — net slightly slower than
    MTP=2 at every c we measured. Worth re-checking if upstream MTP
    draft kernel becomes cheaper-per-K.
  • Prefix caching disabled in the reference numbers above. The
    locally cherry-picked #42784 fix means prefix cache DOES work
    on DSv4 SWA when enabled; we'll publish a companion run with it
    on. End users should keep prefix caching on for deployment.

Acknowledgments

  • @alexbi29 contributed three improvements landed in this revision:
    • Multi-head prefill accumulate kernel (_accumulate_indexed_attention_chunk_multihead_kernel,
      HEAD_BLOCK=8) — patterned after the existing decode
      _finish_materialized_scores_with_sink_kernel. Drops the
      prefill TTFT 20–23 % on long context on this hardware
      (sm12x: multi-head prefill accumulate kernel + drop fp8 einsum autotune jasl/vllm#6).
    • The SWA _cache_block_mask over-aggression for Eagle/MTP
      groups, fixed by vllm-project/vllm#42784 (cherry-picked
      locally pending upstream merge).
    • The _deepseek_v4_sm12x_fp8_einsum_kernel autotune key
      including num_tokens, causing per-request 4-config
      re-benchmarks; we pinned the winning config and removed the
      decorator.
  • @aabbccddwasd contributed the C128A prefill KV gather overlap
    with the indexer (_aux_stream[1] overlap of
    dequantize_and_gather_k_cache with indexer.forward). The
    extracted commit aliases the gather workspace at offset 0 of the
    prefill workspace so _forward_prefill can skip its own gather
    phase and consume the pre-gathered KV directly. Adds +2.7–5 % on
    mt-bench c=24 on top of @alexbi29's multi-head kernel at 65K
    context, with the author reporting +5 % at 128K.
  • @aabbccddwasd's PR-comment suggestions also led to:
    • The per-token early-exit on sparse MLA accumulate inner
      candidate loop (-22 % TPOT at isl=8K c=16 before the
      multi-head kernel landed).
    • The C128A topk metadata kernel loop cap at effective_topk
      (-33 % long-prefill TTFT pre multi-head).
    • The multi-head prefill kernel itself (originally described in
      vllm-project/vllm#41834 discussion; implementation by
      @alexbi29 above).

AI assistance disclosure

Claude (Anthropic) was used for code review, refactoring,
regression-script writing, and baseline-bundle generation. All
kernel logic and architectural decisions were validated by human
review and end-to-end benchmarks before each push.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added deepseek Related to DeepSeek models nvidia v1 labels May 6, 2026
@jasl
Copy link
Copy Markdown
Contributor Author

jasl commented May 6, 2026

@zyongye
I've cleaned up the old PR, could you help review this one?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for DeepSeek V4 on SM12x (Blackwell) architectures by providing Triton-based fallbacks for DeepGEMM-dependent operations. Key enhancements include the introduction of specialized Triton kernels for sparse MLA, FP8 einsum, and MQA logits, as well as memory optimizations in the sparse attention indexer to compute top-k indices without materializing full logits. Additionally, the PR updates the model loader to support weight name filtering for skipping MTP weights and handles Blackwell-specific FP8 quantization scales. I have no feedback to provide.

@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

def _sparse_indexer_requires_deep_gemm() -> bool:
return current_platform.is_cuda() and not (
current_platform.is_device_capability_family(120)
)

P1 Badge Keep DeepGEMM requirement for SM120 FP4 indexer path

This helper now disables the DeepGEMM requirement for every SM120 run, but the FP4 indexer cache path still depends on DeepGEMM kernels (fp8_fp4_*) because the new SM120 fallback only handles q_scale is None (FP8 Q). With use_fp4_cache=True on SM120 and no DeepGEMM installed, construction succeeds and the first prefill/decode call fails at runtime with the DeepGEMM _missing() error instead of being rejected up front.


if self.load_config.load_format == "fastsafetensors":
weights_iterator = fastsafetensors_weights_iterator(
hf_weights_files,
self.load_config.use_tqdm_on_load,
)

P2 Badge Propagate weight_name_filter to fast safetensor loaders

The new pre-load weight_name_filter is only wired into safetensors_weights_iterator; this branch still loads all tensors for fastsafetensors (and similarly other non-default safetensor iterators), so skipped tensors are still materialized. For DeepSeek V4 this defeats the intended early skip of MTP weights and can reintroduce high transient memory use/OOM when these load formats are enabled.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jasl jasl changed the title [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes May 6, 2026
@jasl jasl force-pushed the codex/ds4-sm120-min-enable branch from 042e366 to df2e6f8 Compare May 6, 2026 16:26
jasl and others added 28 commits May 17, 2026 07:26
Fix the SM12x fp8 einsum custom-op registration import, skip unused DeepSeek V4 MTP checkpoint tensors before safetensors materialization, and release MXFP4 setup temporaries after kernel setup.

Signed-off-by: jasl <jasl9187@hotmail.com>
Forward model skip_weight_name_before_load filters into the fastsafetensors iterator and skip filtered keys before materializing tensors. This keeps DeepSeek V4 non-MTP loads from reading MTP-only weights when users select --load-format fastsafetensors.

Keep the regression coverage at behavior level by checking the DefaultModelLoader path and pruning private implementation-field assertions from the adjacent DeepSeek V4 prefix-cache tests.

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Import the production-preview warmups for DeepSeek V4 request preparation, sparse MLA attention, and mHC TileLang kernels while leaving the old warmup test fixture out of the preview branch.

Cherry-picked-from: 0dca30b

Cherry-picked-from: 5959aad

Cherry-picked-from: 7cf6f1d
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Replaces the placeholder configs added in commit 7b0f8b9 ("Add Blackwell
tuning config aliases") with real autotuning results from benchmark_w8a8_
block_fp8.py on the actual hardware.

Coverage:
- M-keys extended from [1, 4, 8, 16, 32] to
  [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] — adds short prefill (M=64..128)
  and long prefill (M=256, 512) anchors that decode dispatch was previously
  rounding down to "M=32" placeholder.
- 6 (N, K) shapes × 4 device variants (RTX PRO 6000 Workstation/Server/
  Max-Q Edition + GB10) = 24 JSON files.
- Hardware-specific: Workstation Edition tuned on physical RTX PRO 6000
  Blackwell Workstation Edition; Server Edition and Max-Q Workstation
  Edition share the SM120 architecture and identical 24G/96G memory
  configs, only TGP differs, so they reuse the Workstation Edition
  tunings. GB10 (SM121) tuned separately on physical hardware.

Search space:
- Base: vllm's get_configs_compute_bound() — 1280 (BLOCK_M, BLOCK_N,
  BLOCK_K, GROUP_SIZE_M, num_warps, num_stages) combinations.
- Per-M filter: BLOCK_SIZE_M >= max(16, M/8) (cap 64) for M>=64 — drops
  configs guaranteed to be catastrophic at large M (cdiv(M, BLOCK_M) > 8
  iterations sentence the kernel to many M-loops on cold cache).
- num_iters: 10 for M<=32, 7 for M=64..128, 5 for M>=256.

Why the placeholders mattered:
- Placeholder had BLOCK_M=16 for every M (since all 5 keys were copies of
  the same config). At M=256 the kernel did cdiv(256, 16) = 16 iterations
  along M; at M=512, 32 iterations.
- Observed behavior: long-prefill at M=256 took 7+ minutes per request,
  M=512 didn't return within 40 minutes. Tuned configs pick BLOCK_M=64..
  128 for these M values (2-4 M-iterations), unblocking long prefill.

Tuning wall clock:
- Workstation Edition: 57.7 min on RTX PRO 6000 Blackwell Workstation
  Edition (single GPU).
- GB10: 66.2 min on NVIDIA GB10 (single GPU).
- Shape 1 + Shape 3 (cold compiles for K=4096 and K=1024) dominated; the
  other 4 shapes each took <2 min via Triton JIT cache reuse (M/N/K are
  runtime args, so (BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages)
  cache hits across (N, K) once the K-divisibility class is compiled).

Same hardware verifies: tests/quantization/test_sm12x_tuned_config_lookup.py
still passes (asserts shape coverage, not contents).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
The single-stream decode profile showed `_fp8_paged_mqa_logits_kernel`
at 12.61% of decode kernel time, 84.87 µs/call — #3 single hotspot
on SM120 TP=2 after T1-A. Investigation: the launch used hardcoded
`BLOCK_M=4` regardless of `num_rows = batch_size * next_n`. For the
common no-MTP single-stream decode case, num_rows=1, which means 75%
of the M-axis work (3 of 4 rows) is masked off and discarded — pure
waste of compute and memory bandwidth.

Fix: pick the smallest power-of-2 tile that still covers num_rows.
- num_rows == 1 (no-MTP decode, batch=1): BLOCK_M=1
- num_rows == 2: BLOCK_M=2
- num_rows in [3, 4] (MTP=2 batch=1, or batch=4 prefill chunks): BLOCK_M=4
- num_rows > 4: BLOCK_M=8

Cost: each unique block_m value compiles a separate Triton
specialization, so cudagraph capture exercises four variants instead
of one. Triton JIT cache amortises this — first warmup adds a few
seconds, subsequent loads cache-hit.

Expected impact:
- Single-stream decode (num_rows=1): 84.87 µs/call → ~25-30 µs/call
  (eliminate 3 of 4 wasted rows). At 42 calls/tok that's ~2.3 ms/tok
  TPOT improvement, ~8% throughput uplift on no-MTP single-stream.
- MTP=2 (num_rows=3 typical): BLOCK_M=4 unchanged (1 row masked, same
  as before). No regression.
- Prefill (num_rows >= 4): BLOCK_M=4 or 8 picked — covers full work.

Risk: low. Kernel logic unchanged; only the launch tile size adapts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
The candidate_block path in finish_materialized_sparse_mla_scores_with_sink
took the caller-supplied value_block_size as BLOCK_D directly. DSv4 calls
it with value_block_size=512 (matmul_sparse_mla_attention_with_sink default
for use_dot_finish=True) but the actual head_dim from combined_kv is
qk_nope+qk_rope = 128+64 = 192. With BLOCK_D=512, the kernel masks off
positions 192..511 per program — 62.5% of D-axis work discarded.

Fix: clamp block_d to the smallest power-of-2 >= head_dim from the allowed
set {64, 128, 256, 512}. For DSv4 head_dim=192 this picks BLOCK_D=256
(25% mask waste instead of 62.5%). Caller-supplied value_block_size
smaller than the target (intentional fine-grained D-axis splits) is still
respected.

Expected impact on SM12x decode profile: _finish_materialized_scores_with
_sink_candidate_block_kernel time per call drops from 17.92 µs to roughly
half (less work per program, same grid size). At 82 calls/token that's
~0.7 ms/tok TPOT savings → ~2-3% throughput uplift on top of T1-D.

Risk: low. Kernel logic unchanged; only the per-launch BLOCK_D adapts to
the actual head_dim, falling back to 512 for head_dim > 512.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
The prefill warmup constant was 1024 tokens. With `max_num_batched_tokens
= 8192` (the canonical SM12x serve setting), the first real request that
prefills more than 1024 tokens in a single chunk has to JIT-compile the
dense FP8 W8A8 block-scaled GEMM at the larger M, plus the sparse-MLA
prefill kernel against a longer KV slab. T1-A's autotuned config space
makes the cold-compile cost bigger, not smaller, so any user who issues
an 8K-context first request after a fresh serve currently waits on
Triton compilation that the warmup hook is supposed to absorb.

Lift the constant to 8192. The call site already clamps via
`_clamp_warmup_tokens(requested, scheduler_config.max_num_batched_tokens)`
so schedulers running with a smaller batched-token cap naturally warm at
their cap, and configurations that lift the cap above 8192 keep this
floor (the cost of warming beyond 8192 grows fast enough that we want a
deliberate decision rather than implicit scaling).

Measured on 2x RTX PRO 6000 Workstation Edition (SM120, TP=2 EP,
max_num_batched_tokens=8192) with a cold random ISL=8192 OSL=512
num-prompts=4 c=1 bench against a freshly-restarted serve:

  Before: TTFT mean ≈ 17 s (cold first request dominates the average)
  After:  TTFT mean 3,172 ms, TTFT p99 3,176 ms (mean ≈ p99 — the
          cold-start variance disappears)
  Throughput: 61.16 tok/s vs 54.20 tok/s on the 020e0c8 baseline
              for the same shape (+13 %)
  Startup time: 71 s -> 80 s (+9 s one-time)

Signed-off-by: jasl <jasl9187@hotmail.com>
Three follow-on fixes on top of `5c8975591`:

1. Drop the hardcoded `_DEEPSEEK_V4_MTP_UNIFORM_DECODE_WARMUP_REQUESTS
   = (1, 2)` ceiling and append `scheduler_config.max_num_seqs` so MTP
   uniform-decode warmup also covers the largest in-flight batch the
   server will ever issue. On a Spark MTP=2 cluster with
   max_num_seqs=4 this lifts the random ISL=8,192 OSL=512 c=4 cold
   throughput from 23.67 tok/s to 42.82 tok/s (+81 %) by warming the
   `_fp8_paged_mqa_logits_kernel` adaptive `BLOCK_M=8` path that the
   (1, 2) tier missed.

2. Add a chunked-prefill warmup `_dummy_run` that sets
   `profile_seq_lens = prefill_tokens * 2` so the indexer / sparse-MLA
   builders see "this is the second chunk of a longer sequence", not
   only the freshly-arriving single chunk.

3. Add a multi-request prefill warmup `_dummy_run` (no `create_*`
   flags) so the runner splits the batched-token budget across
   `max_num_seqs` requests and exercises the multi-prefill indexer
   path that single-request prefill warmup skips.

Cost: ~+35 s startup on Spark (init engine: 61.33 s -> 96.81 s) for a
one-time JIT pass over the larger shape coverage.

Limitation: vLLM's `jit_monitor` shows nine kernels still JIT during
the first c=1 cold bench, including `eagle_prepare_next_token_padded_kernel`
and `_w8a8_triton_block_scaled_mm` at alt shapes. These kernels are
already invoked from `_run_deepseek_v4_mtp_spec_decode_warmup_kernels`,
but the synthesized warmup tensors hit a different Triton specialization
key (notably pointer 16-byte alignment) than the sampler / spec-decode
buffers used in real inference. Closing this gap requires routing
warmup through the actual scheduler / sampler pipeline rather than a
dummy_run helper, which is a larger upstream change. The harness
(`scripts/prewarm_serve.sh`, also auto-invoked by
`scripts/dgx_spark_start_mp_serve.sh`) issues real-pipeline prewarm
requests after `/health=200` to absorb the remaining cold-start cost
on the deployment side.

Signed-off-by: jasl <jasl9187@hotmail.com>
The earlier `ds4-sm120-full` PoC branch shipped two FP8 paged-MQA
logits kernels — a generic 2D-tiled one and a per-row variant
(`_fp8_paged_mqa_logits_rowwise_kernel`) tuned for long decode
contexts. During the file split that produced
`vllm/v1/attention/ops/deepseek_v4_ops/sm12x_mqa.py`, only the
2D-tiled kernel was carried over; the rowwise variant and its
dispatcher gate were dropped.

Users running ctx > 100K with MTP=2 on RTX PRO 6000 (Max-Q) report
~20% throughput regression vs the PoC branch on the
"Red-Black Tree, max_tokens=2048, thinking on" 5-run probe
(~85 tok/s here vs ~108 tok/s on `da4f1c711`). Single-stream short
contexts are unaffected because the 2D-tile work scales with
`token_count` cdiv 64 and the rowwise win comes from Q-reuse across
the full 4K-128K window per program — exactly the regime the bug
report hits.

This commit restores the rowwise kernel verbatim from `da4f1c711`
(its routing predicate is aligned with `4c9ee613d`, dropping the
`next_n == 1` constraint so MTP=2 also hits the rowwise path), and
re-introduces the dispatch in `fp8_paged_mqa_logits_triton`:

    if head_dim % 64 == 0 and num_heads % 4 == 0:
        return fp8_paged_mqa_logits_rowwise_triton(...)

DSv4-Flash (head_dim=128, num_heads=64) always satisfies both
predicates so all real serves take the rowwise path; the 2D-tiled
kernel remains as the fallback for misaligned shapes and is still
the canonical reference the rowwise kernel was validated against in
the original PoC tests.

The recently-added T1-D adaptive `BLOCK_M` (commit `959a04df5`) is
preserved in the 2D-tiled path. On DSv4-Flash it becomes dead code
in practice, but kept for portability and to keep the diff isolated
from the long-context regression fix.

Signed-off-by: jasl <jasl9187@hotmail.com>
…aming

Adds a DSv4-specific reasoning parser
(`DeepSeekV4ThinkingReasoningParser`, `DeepSeekV4ReasoningParser`) that
treats the DSML tool-call start marker `<|DSML|tool_calls>` as an
implicit end-of-reasoning when the explicit `</think>` token is absent.

Why
---
DSv4-Flash at long context (~95k-100k input tokens) occasionally fails
to emit `</think>` before opening a tool call. The existing
`DeepSeekR1ReasoningParser` keeps the parser stuck in reasoning mode in
that case: the tool-call start marker (and everything after) is
classified as reasoning, the orchestrator never advances to the tool
parser, and the caller sees a turn with reasoning but no tool call.
opencode's agent loop interprets that as "no tool to dispatch" and
exits — visually indistinguishable from "the model gave up".

Reproduced 18% of the time at 95-100k input tokens with `tool_choice`
auto and 25 tools in scope. Full repro bundle (Python script + SSE
trace + opencode forensics) lives in the harness repo.

What
----
- New module `vllm/reasoning/deepseek_v4_reasoning_parser.py` providing
  `DeepSeekV4ThinkingReasoningParser` (extends
  `DeepSeekR1ReasoningParser` with one defensive split), plus the
  dispatcher pair `DeepSeekV4ReasoningParser` and
  `DeepSeekV4ReasoningWithThinkingParser` matching the V3 shape.
- The dispatcher mirrors `DeepSeekV3ReasoningParser`: thinking-mode
  uses the V4 extension, non-thinking uses `IdentityReasoningParser`.
- Sticky `_implicit_end_seen` flag on the parser instance ensures
  `is_reasoning_end[_streaming]` returns True for every delta after the
  marker first appears, so the orchestrator state machine transitions
  correctly even when the marker spans a token boundary.
- `vllm/reasoning/__init__.py` re-points the `deepseek_v4` registration
  from `DeepSeekV3ReasoningParser` to the new
  `DeepSeekV4ReasoningParser`. `deepseek_v3` is unchanged.

What does NOT change
--------------------
- Healthy streams (explicit `</think>`) take the same code path as
  before: the V4 parser defers to `super()` and the defensive split
  only fires when no explicit start/end token has been seen.
- The DSv32 tool parser is untouched.
- V3 reasoning parser and registration are untouched.

Tests
-----
- `tests/reasoning/test_deepseekv4_reasoning_parser.py` covers the
  registration, dispatcher selection, healthy paths (parent
  behaviour), implicit-end-marker in isolated delta, implicit-end split
  within delta, sticky behaviour after first marker, suppression when
  `<think>` is explicitly present, `is_reasoning_end` for explicit and
  implicit cases, and the parent's single-token guard.
- `tests/reasoning/test_deepseekv3_reasoning_parser.py` updated: the
  `deepseek_v4` alias now resolves to `DeepSeekV4ReasoningParser`,
  while `deepseek_v3` still resolves to `DeepSeekV3ReasoningParser`.

The fix is intentionally narrow: it addresses one well-defined failure
mode (tool call without closing reasoning). The "runaway reasoning to
length limit with no tool call" and "premature reasoning stop with no
tool call" subtypes seen at long context are model-behaviour issues,
not parser bugs, and are left for a separate follow-up.

Signed-off-by: jasl <jasl9187@hotmail.com>
…pper

Replaces the previous attempt to put the decorator directly on
``HCHeadOp.forward_cuda``: when the outer model is wrapped by
``@support_torch_compile`` (the no-MTP path on SM12x) dynamo can't
inline-bind the decorated method through ``CustomOp._forward_method``
and the worker dies with::

    torch._dynamo.exc.Unsupported: failed to bind arguments when
    attempting to inline forward_cuda

That blocks every no-MTP serve on SM12x. Move the body into a free
``_hc_head_cuda_impl`` decorated with
``@torch.compile(backend=simple_compile_backend)`` — the layout that
existed pre-upstream-vllm-project#41946 — so the method just delegates and dynamo
no longer needs to inline a decorated method. Recovers the DSv4-Flash
MTP=2 spec-acceptance gain reported in 16ee3bd (67.6 % → 59.8 % drop)
without breaking the no-MTP startup path.

``forward_hip`` is unchanged: ROCm doesn't take the same outer
``@support_torch_compile`` route, so the method-level decorator is
fine there.

Signed-off-by: jasl <jasl9187@hotmail.com>
The third ``_dummy_run`` call added in f4b3301 ("Extend DeepSeek V4
warmup coverage to multi-request shapes") synthesizes a multi-request
prefill batch and runs it through ``force_attention=True``. On SM12x
this trips an illegal memory access inside the CUTeDSL
``DequantGatherKCacheKernel``: the dummy shape exceeds the
``offset + gather_len <= M`` invariant of the kv-gather output buffer
(M is sized for the single-prefill warmup case, not for the
multi-request layout).

Reproduced with ``CUDA_LAUNCH_BLOCKING=1``::

    File ".../dequant_gather_k_cutedsl.py", line 29, in
      dequantize_and_gather_k_cache_cutedsl
    DequantGatherKCacheKernel.compile(...)(
        out, k_cache, seq_lens, gather_lens, block_table, offset)
    RuntimeError: CUDA Error: cudaErrorIllegalAddress

Drop just this third warmup call. The other two (single-prefill and
second-chunk-of-chunked-prefill) and the MTP uniform-decode coverage
from f4b3301 stay. The trade is a one-time JIT compile on the first
real multi-prefill user request for the un-pre-warmed indexer path;
the alternative is failing to start the serve at production
``--max-num-seqs`` values (e.g. 128).

A proper fix would reconcile the gather-buffer sizing for the
multi-request prefill warmup with the kernel's bounds; left for a
follow-up.

Signed-off-by: jasl <jasl9187@hotmail.com>
``FlashMLASparseMetadataBuilder.get_cudagraph_support`` and the
parallel override in ``DeepseekSparseSWAMetadataBuilder`` were guarded
on::

    getattr(kv_cache_spec, "model_version", None) == "deepseek_v4"
    and is_triton_sparse_mla_enabled_for_platform()

The first clause never holds at runtime: the spec the runtime passes to
``get_cudagraph_support`` is an outer
``UniformTypeKVCacheSpecs`` wrapper (``vllm/v1/kv_cache_interface.py``)
that only exposes ``block_size``; the per-layer
``MLAAttentionSpec.model_version`` lives one level down under
``.kv_cache_specs``. So the overrides silently fall through to
``cls._cudagraph_support = AttentionCGSupport.UNIFORM_BATCH`` and
cudagraphs are captured normally — confirmed by instrumenting the call
and by an ``--enforce-eager`` probe (FULL cudagraphs give a ~2.4×
decode throughput speedup on 2× RTX PRO 6000 at ISL=2048 OSL=2048
c=16).

Cudagraph capture is also fine for the MTP=2 path on this stack — the
spec-decode acceptance and TPOT match the no-MTP measurement to within
mt-bench noise (66.10 % acceptance, length 2.32).

Since the override was both dead code and would *reduce* performance
if "fixed" to actually fire, drop it. The default ``UNIFORM_BATCH``
support level on both builders already does the right thing.

Signed-off-by: jasl <jasl9187@hotmail.com>
Defensive ``.contiguous()`` on ``decode_metadata.seq_lens[:batch_size]``.
On an already-contiguous slice this is a no-op pointer return; on a
non-contiguous 2D slice (max_decode_len < next_n under V2 model
runner cudagraph capture) it materializes a contiguous copy that
satisfies ``persistent_topk`` and the FP8 MQA paged-logits kernels.

Reported by @aabbccddwasd in PR vllm-project#41834 (comment 4450901180) as a
crash workaround on their 4× RTX PRO 6000 TP=4 setup; cost is zero
on the path we currently exercise (already contiguous).

Signed-off-by: jasl <jasl9187@hotmail.com>
``_deepseek_v4_sm12x_fp8_einsum_kernel`` was launched with hardcoded
``num_warps=4 num_stages=3``; ``_fused_indexer_q_rope_quant_kernel``
was launched with ``num_warps=1`` (with a "TODO: Tune this"
inline). Replace both with ``@triton.autotune`` so the best
warp/stage config is picked per shape:

- fp8_einsum: configs over ``{(4,3), (8,3), (4,2), (8,2)}`` keyed on
  ``(num_tokens, num_groups, out_rank, hidden_size)``.
- fused_indexer_q: configs over ``num_warps={1,2,4}`` keyed on
  ``(INDEX_Q_HALF_ROT_DIM, INDEX_Q_HEAD_DIM)``.

Both kernels are launched per forward, so autotune fires once per
unique key and the cached selection is reused on subsequent calls.

Reported by @aabbccddwasd in PR vllm-project#41834 (comment 4450901180).

Signed-off-by: jasl <jasl9187@hotmail.com>
Add ``@triton.autotune({(num_warps, num_stages) in {4,8} × {2,3}})``
to the three single-head prefill accumulate kernels in
``sparse_mla_kernels.py``::

- ``_accumulate_indexed_attention_chunk_kernel``
- ``_accumulate_fp8ds_global_slots_attention_chunk_kernel``
- ``_accumulate_fp8ds_paged_attention_chunk_kernel``

Each was previously launched with hardcoded ``num_warps=8``; the new
configs explore ``{4,8}`` × ``{2,3}`` keyed on ``num_candidates``
(the dominant per-shape factor). Autotune fires once per
``num_candidates`` value seen at runtime and the chosen config is
cached for subsequent calls.

The two multihead variants (``..._multihead_kernel``) are NOT
autotuned in this commit: they share the same accumulate-read-write
pattern but per @aabbccddwasd's note (PR vllm-project#41834 comment 4450901180)
need a separate ``num_tokens: tl.constexpr`` + ``reset_to_zero``
treatment for autotune correctness, which we'll add in a follow-up
once we've validated the single-head gain on this hardware.

Reported by @aabbccddwasd in PR vllm-project#41834 (comment 4450901180);
claimed ~+39 % prefill on 4× RTX PRO 6000 TP=4 32K ctx, with the PR
base already having a higher baseline so absolute gain is smaller.

Signed-off-by: jasl <jasl9187@hotmail.com>
Tuned via ``scripts/_fp8_block_tune_driver.py`` for the three remaining
DSv4-Flash dense linear shapes the workstation hits at TP=2 but didn't
yet have ship-tuned configs for:

- N=4096, K=2048  (q_b / gate projection)
- N=1024, K=4096  (wq_b projection)
- N=4096, K=512   (wo_b projection)

Suggested by @aabbccddwasd in PR vllm-project#41834 (comment 4450901180). Tuned
on the local 2× RTX PRO 6000 Blackwell Workstation Edition host with
the same wrapper that produced the existing six configs in this
directory; lookup is device-name keyed so no code changes required.

These complement the existing six WS-edition configs (N,K) ∈
{(1536, 4096), (2048, 4096), (4096, 1024), (4096, 4096), (8192, 1024),
(16384, 1024)} so DSv4-Flash now hits a tuned config for every dense
linear shape it issues, instead of falling back to the default
heuristic for the three shapes above.

Signed-off-by: jasl <jasl9187@hotmail.com>
…nges)

Cudagraph-safe retry of suggestion #2 from PR vllm-project#41834 comment 4450901180.
Previous attempt (e34daef, reverted) also exposed
``c128a_*_effective_topk`` on the metadata and truncated the buffer
slice inside ``deepseek_v4_attention``; that truncation baked the
shape into the captured forward launch, breaking replay when
``effective_topk`` shifted between capture and replay.

This version only touches the metadata builder (which already runs
*outside* the captured forward), so per-call ``effective_topk``
variation is fine:

1. Pre-fill ``global_decode_buffer[:num_decode_tokens]`` and
   ``prefill_buffer[:num_prefill_tokens]`` with ``-1`` before launch.
2. Compute ``effective_topk_arg = cdiv(max num_compressed across
   in-flight tokens, BLOCK_SIZE) * BLOCK_SIZE``, capped at
   ``max_compressed_tokens``.
3. Kernel inner loop uses ``effective_topk`` (was
   ``max_compressed_tokens``); store mask uses the same.

The buffer entries the kernel skips
(``[effective_topk, max_compressed_tokens)``) stay at ``-1`` from the
pre-fill, so downstream sparse MLA accumulate kernels (which still
iterate the full ``max_compressed_tokens`` width inside the cudagraph)
see only ``-1`` sentinels in the tail and short-circuit them via
``kv_index >= 0`` / ``candidate < valid_len`` checks. No tensor shape
changes inside the captured forward → cudagraph capture/replay
remains correct.

Savings here are limited to the metadata kernel itself; the accumulate
kernels' iteration count is unchanged (their loop bound is the captured
``num_candidates`` shape value, which we deliberately do not narrow).
Bench at long ``max_model_len`` will confirm whether this is enough to
recover a meaningful chunk of the ~27 % TPOT regression observed at
``max_model_len=131072`` vs ``8192``.

Signed-off-by: jasl <jasl9187@hotmail.com>
…date loop

Redesigned suggestion #3 from PR vllm-project#41834 comment 4450901180. The first
attempt (e34daef, reverted; later 72a5ff2, also reverted) tried to
truncate ``topk_indices.shape[1]`` in Python so the captured launches
iterated a narrower combined slice; that approach broke under cudagraph
replay (shape baked at capture) and *also* mis-bounded — the combine
kernel writes each token's combined buffer as ``[topk_len_t |
swa_len_t | -1 padding]`` with SWA *immediately* following the
per-token topk, so a fixed ``effective_topk`` cap cuts off the SWA
portion (GSM8K dropped 25 pp on the prior attempt).

The kernel already loads the per-token combined length
(``valid_len = tl.load(lens_ptr + token_idx)`` for the four ``lens``-
gated kernels, ``gather_len`` for the two paged kernels). The existing
``is_valid`` guard only short-circuits the *heavy* work past that
length; the outer ``for candidate_idx in range(0, num_candidates)``
still pays one ``tl.load`` + branch per iter on the dead tail.
Capping the loop at
``min(num_candidates, valid_len - candidate_offset)`` (clamped to 0)
removes those wasted iterations while preserving the existing
``is_valid`` semantics: the iterations we now skip are exactly those
the existing guard already discarded.

Applied to six accumulate kernels in ``sparse_mla_kernels.py``:

- ``_accumulate_gathered_attention_chunk_kernel``
- ``_accumulate_indexed_attention_chunk_kernel``                 [autotuned in #1]
- ``_accumulate_fp8ds_global_slots_attention_chunk_kernel``      [autotuned in #1]
- ``_accumulate_fp8ds_global_slots_attention_chunk_multihead_kernel``  [decode]
- ``_accumulate_fp8ds_paged_attention_chunk_kernel``             [autotuned in #1]
- ``_accumulate_fp8ds_paged_attention_chunk_multihead_kernel``   [decode]

CUDA-graph safety: ``lens_ptr`` / ``gather_lens_ptr`` are stable
addresses; their values are refreshed per call by the metadata builder
(outside the captured forward) and by ``combine_topk_swa_indices``
(inside the forward but writing only into the persistent buffers the
accumulate kernels read from). The kernel inner-loop bound is a
runtime-loaded scalar — Triton compiles a dynamic loop and the captured
launch picks up the current value on each replay.

Savings scale with ``combined_topk_buffer_width - actual valid length``
(i.e. mostly visible at long ``max_model_len`` with shorter actual
contexts). At our test shape (``max_model_len=131072``, ISL=2048) the
saved iterations come mostly from the decode multihead path; expected
to be neutral / no-regression at short ``max_model_len`` where the
bound equals ``num_candidates``.

Signed-off-by: jasl <jasl9187@hotmail.com>
Three pure comment/docstring fixes from the audit, no behavior change:

1. ``_build_c128a_topk_metadata_kernel`` comment was ambiguous about
   ``max_compressed_tokens`` after the parameter was renamed to
   ``effective_topk`` in 304944e. Reword to explicitly point at the
   Python caller (``build_c128a_topk_metadata``) and explain that
   ``max_compressed_tokens`` is the buffer column width and entries
   past ``effective_topk`` stay at ``-1`` via the caller's ``fill_(-1)``
   pre-pass.
2. Add an inline note next to ``positions.max().item()`` flagging it as
   a host sync that is safe here because the builder runs outside the
   captured forward.
3. Expand ``MLAAttentionManager`` class docstring: the predicate
   ``_should_protect_prompt_blocks`` triggers on three independent
   conditions (DSv4 model_version, fp8_ds_mla cache_dtype_str, or
   compress_ratio > 1), not just DSv4. Document the three conditions
   inline so a future tightening pass does not accidentally narrow the
   coverage.

Signed-off-by: jasl <jasl9187@hotmail.com>
…kip_weight_name

Two refactors from the audit, no behavior change:

1. ``vllm/v1/attention/ops/deepseek_v4_ops/fp8_einsum.py`` had its own
   copy of ``_upcast_e8m0_to_fp32`` (4 lines, identical to the canonical
   helper at ``vllm/model_executor/layers/quantization/utils/fp8_utils.
   py:1017``). Other peer call sites (cutlass.py, rocm_aiter_mla_sparse.
   py, mxfp4.py) already import from ``fp8_utils``; do the same here.

2. ``DeepseekV4ForCausalLM.skip_weight_name_before_load`` used
   ``hf_to_vllm_mapper.apply_list([name])`` to map a single name. That
   builds a one-element list and routes through a list-comprehension
   that filters ``None``. Use the canonical 1-to-1 helper
   ``WeightsMapper._map_name`` directly, matching the pattern used in
   ``compressed_tensors.py``, ``adapters.py``, ``bitsandbytes_loader.
   py``, and ``lora/utils.py``. Same semantics, 3 lines instead of 5.

Signed-off-by: jasl <jasl9187@hotmail.com>
…te kernels

After ``a94e7c289 sm12x: per-token early-loop-exit on sparse MLA
accumulate inner candidate loop`` capped each inner loop at
``local_eff = min(num_candidates, max(valid_len - candidate_offset, 0))``
(or the ``gather_len`` equivalent for the paged kernels), the per-iter
check ``(candidate_offset + candidate_idx) < valid_len`` /
``gather_idx < gather_len`` became structurally always true: by
construction every iteration's index sits inside the valid range.

This commit drops the tautological term in 7 sparse MLA accumulate
kernels and leaves the remaining cell-sentinel guard:

- ``accumulate_..._gathered_chunk`` (was: ``(...) < valid_len`` then
  AND with ``slot_id >= 0`` when ``HAS_SLOT_IDS``): now just
  ``is_valid = slot_id >= 0`` (or ``True`` when ``HAS_SLOT_IDS`` is
  false). The branch on ``HAS_SLOT_IDS`` becomes a ``tl.constexpr``
  binary, which Triton compiles into two clean specialisations.
- ``accumulate_..._indexed_chunk``: ``is_valid = kv_index >= 0``.
- ``accumulate_fp8ds_global_slots_sparse_mla_attention_chunk{,_multihead}``:
  ``is_valid = slot_id >= 0``.
- ``accumulate_fp8ds_paged_sparse_mla_attention_chunk{,_multihead,
  _multihead_with_sink}``: there is no per-cell sentinel here, so the
  whole ``is_valid`` variable and ``if is_valid:`` guard go away and
  the loop body becomes unconditional.

Each touched site gains a 2-3 line comment explaining the invariant so
a future reader can see why no per-iter clamp is needed. No behavioral
change: Triton was already eliminating the tautology after the SSA
pass; this commit makes the intent explicit at the source level.

Signed-off-by: jasl <jasl9187@hotmail.com>
PR vllm-project#42258 introduced SlidingWindowManager._cache_block_mask() to skip
caching SWA blocks that can never serve a prefix-cache hit.  When
Eagle/MTP speculative decoding is active the mask is too aggressive —
it skips blocks that eagle's modified lookup actually needs, resulting
in 0% prefix cache hit rate.

Eagle changes the SWA hit logic in two ways:
  1. sliding_window_contiguous_blocks += 1 (needs one extra block)
  2. post_pop_blocks = i (instead of i+1), shifting alignment

Fix: detect SWA managers inside eagle attention groups at coordinator
init time and disable the cache block mask for them.

Signed-off-by: Alex Bilichenko <abilichenko@gmail.com>
(cherry picked from commit b90c495)
Signed-off-by: jasl <jasl9187@hotmail.com>
Two prefill performance fixes for SM12x DeepSeek V4:

1. Add _accumulate_indexed_attention_chunk_multihead_kernel (HEAD_BLOCK=8)
   that loads KV once per candidate and reuses across 8 heads, reducing
   L2 traffic in the prefill accumulate phase. Same pattern as the
   existing decode _finish_materialized_scores_with_sink_kernel.

   Prefill throughput on 2× RTX PRO 6000 WS, TP=2, MTP=2:
   - 1K tokens: +49% (2,746 → 4,100 tok/s)
   - 4.5K tokens: +37% (3,122 → 4,271 tok/s)
   - 18K tokens: +36% (2,474 → 3,360 tok/s)
   - 64K tokens: +28% (1,679 → 2,146 tok/s)

   Tuned config: HEAD_BLOCK=8, num_warps=4, num_stages=2. Benchmarked
   against HEAD_BLOCK=4 and num_warps=8 variants — HEAD_BLOCK=8 with
   num_warps=4 wins at all sizes.

2. Drop @triton.autotune from _deepseek_v4_sm12x_fp8_einsum_kernel and
   pin num_warps=4, num_stages=3. The autotune key included num_tokens
   which varies per request, causing ~200 unique keys with zero cache
   hits — re-benchmarking 4 configs at ~1s each on every request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 9c2e7ca)
Signed-off-by: jasl <jasl9187@hotmail.com>
Run dequantize_and_gather_k_cache for the compressed + SWA caches on
aux_stream[1] while the indexer forward runs on aux_stream[0], so the
gather is hidden behind the indexer instead of serialising before
_forward_prefill.

The workspace allocation at the wrapper places kv_workspace at offset 0
of the same per-ubatch workspace buffer that _forward_prefill would
otherwise allocate kv from; _reserve_prefill_workspace during warmup
already grew the buffer to fit the full prefill spec list, so the
kv-only request cannot trigger a resize that orphans kv_workspace
mid-forward. A gather_done_event joins the aux stream back before
mla_attn runs.

CUDA-graph-safe: both aux streams join (event.wait()) before the
attention boundary, and the gather is gated on num_prefills > 0 and
num_prefills <= PREFILL_CHUNK_SIZE (single-chunk only). Multi-chunk
prefill or non-C128A paths fall through to the existing per-chunk
gather in _forward_prefill.

Original implementation by aabbccddwasd in their dsv4-sm120-opt-v2
branch (commit 6ff395e). This re-applies only the gather-overlap
half of that commit; the multi-head prefill kernel half is dropped
because the canonical tip already has alex's HEAD_BLOCK=8 version
(671958e / vllm-project#41834 PR #6) which was empirically
tuned for this hardware.

Signed-off-by: jasl <jasl9187@hotmail.com>
@arthur-drozdov
Copy link
Copy Markdown

Some initial benchmark results on dual dgx spark using recipe eugr/spark-vllm-docker#219, potentially issues with concurrency at large context? Not sure if the same issue is on other hardware

llama-benchy completed 84 CSV rows: depths 0, 4K, 8K, 16K, and 32K across c1/c2/c5/c10, plus 65K across c1/c2/c5. The 65K/c10 leg was stopped after repeated client-side connection resets under capacity pressure. At that point vLLM was still up, request abort/error counters were zero, and metrics showed 3 running / 7 waiting by capacity with KV cache around 96%.

Decode throughput, tg128, total tok/s:

depth c1 c2 c5 c10
0 37.32 54.16 43.64 43.09
4096 27.19 40.00 25.90 27.23
8192 35.21 30.82 26.72 20.52
16384 37.54 27.55 16.63 13.52
32768 33.51 19.01 12.64 2.25
65535 29.50 1.11 0.74 -

Prompt throughput, pp2048, total tok/s:

depth c1 c2 c5 c10
0 1097.6 977.9 1044.3 975.4
4096 738.9 598.3 571.3 552.6
8192 613.4 548.1 460.7 411.1
16384 471.1 467.9 307.7 280.2
32768 318.4 322.6 222.6 39.4
65535 176.0 17.2 11.9 -

The useful interactive envelope from this run looks good through 32K at lower concurrency and gets capacity-bound hard at 65K with concurrent traffic. That matches the earlier observed streaming pauses: the process is alive, but long-context concurrent requests can sit behind KV/scheduler capacity rather than failing immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.