Genesis vLLM Patches v7.64 released — please test #19
Replies: 49 comments 9 replies
-
|
Just upgraded our pin to v7.64 today and started cross-rig validation on RTX 3090 single-card (vLLM Anchor drift status on our pin:
Open issue v7.64 doesn't address — Pin question: all of v7.64's empirical validation was on Backport priority feedback for v7.65:
Also strongly +1 on the Cliff 8 hardening ( Will share apples-to-apples 27B+TQ k8v4 numbers after the current bench finishes. ⭐ given. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @noonghunna — thanks for the detailed boot report, this is exactly the kind of cross-rig validation I was hoping for. Quick AI-translated notes from Odessa (apologies for any English roughness). On the pin question — when I write a vLLM SHA in the README, that means I've actually rebuilt at that SHA, re-run the validator on it, re-baked the patcher under it, and re-validated the reproducer (35B FP8 tool-call + 27B Lorbus). It's not a "we should be on this someday" — it's "this is what I'm running right now, and it works". Update cadence is value-vs-regression, not a calendar. Concrete examples for context:
So when you see On PN25 (forward_native inductor bypass) — you're absolutely right that PN12 leaks past the compile path. Genesis stack has the same flaw; we don't hit it in PROD only because our 27B Lorbus + cudagraph FULL_AND_PIECEWISE config short-circuits the inductor pipeline on this kernel. Future inductor-default configs would expose it. Just landed PN25 on On PN19 ≠ H100 ergonomics — agreed, will flag in CLIFFS.md that the 200-500 MiB win is H100-specific. We saw similar non-transfer on P104 L2 persistence (regressed -16.2% on 32+ layer KV >> L2 setups). Generic allocator hints don't survive class jumps. On Cliff 8 hardening — On backports — your priorities track mine:
Numbers from your apples-to-apples 27B+TQ k8v4 bench will be very useful — esp. if 3090 lands within 5% of our A5000 reference. Hard data on that gap is one of the things I haven't been able to gather on my own rig. ⭐ much appreciated, and thanks for keeping the cross-rig pipeline honest. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the pin clarity — clear gate criteria (rebuilt + validator + reproducer + tools-API regression check) is exactly what makes the SHA actionable rather than aspirational. Will track your README for future pin moves. v0.20 result on our 3090 / Qwen3.6-27B + TQ3 + MTP K=3 config — different outcome from your A5000 fleet. Boot was clean (all v7.64 patches apply natively, including PN25 sister-pair), but engine crashed during MTP draft proposal at long prefill: Stack: Root-caused to vllm#39226 — the strict Likely a config-shape mismatch between our setups (yours probably hits TQ decode during profile_run via different draft/cache geometry, ours doesn't — guessing TP=1 + INT4 group_size=128 Marlin + MTP K=3 vs A5000 TP=N + Lorbus). Tracking on a separate On PN25 + P38 — both extremely relevant for our cliff investigation. Detailed reply on club-3090#16 but short version: PN25 is independent convergence on the same fix we just shipped locally; we'll plumb your Will also surface the H100-vs-Ampere PN19 footnote in our CLIFFS.md so users on consumer 3090 don't re-discover the negative. ⭐ — and "Speed without correctness is a regression" should be the tagline for this whole project tbh. |
Beta Was this translation helpful? Give feedback.
-
|
27B + TQ k8v4 dual-3090 bench + CONFIGS.md feedback (closing out the asks from your v7.64 ship post). 1. TQ k8v4 dual-3090 benchOur compose: 2× RTX 3090 24 GB PCIe (no NVLink), TP=2, AutoRound INT4, vLLM
Comparing to your A5000 reference (
Most likely contributors to the gap:
Side-by-side memory profile (might be useful for triage): at 0.90 mem-util we sit at 20.4 GB/24 GB per card, leaving ~3.5 GB headroom on each — plenty of activation room. So the gap isn't OOM-pressure; it's pure throughput. Want me to A/B with the full env-var set? Happy to run if useful — would isolate whether the 13% is env-vars or pin/hardware. 2. CONFIGS.md walkthrough feedbackWalked through the doc end-to-end while the bench booted. Strong overall — quick decision tree at the top, "5 things to write down" before editing, per-bucket patch lists with 1-line "what does it do" each, and the worked Llama-3 70B example at the end ("generic patches work outside Qwen") all hit the target. Friction points that surfaced when I tried to mentally re-execute it for our 27B/3090/Docker setup:
Smallest single-change with biggest impact: fix script naming (#1). That's the first thing every new operator runs into and the doc directly disagrees with Both items closed out. Backport priorities + ⭐ already in our previous reply. P38 silent-no-op trace filed separately as genesis-vllm-patches#14. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @noonghunna — thanks for the dual-3090 bench data + CONFIGS feedback. Gonna address both halves carefully so the facts stay grounded. 1. The 13% gap on TQ k8v4Your gut is right that the env-var subset is the dominant contributor. Yes please run the A/B with the full set — this is the cleanest data point we can get for the doc. One factual correction first: P82 is actually OFF in our 27B PROD launch script, not on. Verifiable in The patches that are PROD-on for the 27B TQ k8v4 path (verified from current launch script): P82 stays OFF; P78 stays OFF. (source-of-truth file — copy from line 36–53.) On your three contributing factors:
So at full env-var set + same pin: expected ~5-8% residual gap at most, closer to your hardware-only floor. 2. CONFIGS.md feedback — all 8 friction points addressedEvery one was actionable. Pushed fixes to dev in #1 Script naming mismatch. Fixed. Updated table to reference real files ( #2 Docker compose path invisible. Added new Step 2b — Docker compose mirror with worked compose snippet (~25 env vars from #3 TQ k8v4 deps scattered + P4 has no description. P4 description was actually present at line 243 — "P4 — required, removes hybrid TQ rejection" — but you're right it was hard to find. Added two consolidated copy-paste blocks: "Required for boot" (P4 + P67 + P98 + P101 + PN8) and "Recommended PROD additions" (~25 env vars). The required block is now the first thing readers see in the TQ k8v4 section. #4 API key repo-baked. Added explicit fallback note in Step 5's smoke test: "If you launched without #5 #6 Spec-decode trio without gating signals. Step 3's spec-decode block now back-links Step 1 §4 for each method capability check (ngram always works / MTP needs #7 DFlash on hybrid models — PR #40898 caveat. Added 1-line caveat next to the DFlash spec-decode option pointing at vllm#40898 OPEN status + Genesis PN21 partial backport state. ~25% acceptance-length gap acknowledged until upstream merges. #8 Step 7 — submit-back format spec. Step 7 already links Smallest-impact thanks for the prioritization — script naming was indeed the biggest first-touch friction and got fixed first. 3. Heads-up — your issues #14 + #15 are landed on
|
Beta Was this translation helpful? Give feedback.
-
|
A/B bench results — full env-var set on TQ k8v4 dual-3090. ⭐ Following your reply: bumped Genesis pin to dev tip ( Bench (n=5 measured, 3 warm, scripts/bench.sh canonical narrative + code prompts):
Headline: code wall_TPS 116.59 on dual 3090, +30.7% over your A5000 89.23 reference. Narrative 92.12 lands at the bottom edge of your 95-100 target band (likely the bench prompt class — narrative has more variable acceptance length, code is repetitive enough for MTP to amortize hard). Variance analysis on the +50% jump (full vs subset): The patches that were absent from our earlier subset and present now:
If you want a per-patch ablation, I can run a few targeted A/Bs (e.g. Notable: PN26b's "first sparse-V kernel deployed for SM86 (Ampere consumer)" log line is correct — Ampere consumer users now have a path to sparse-V tile-skip that doesn't exist anywhere upstream. That alone is a sizable contribution for the SM86 fleet beyond just our rig. P38B + P15B both apply cleanly on our v0.20-blocked config → boot clean, sustained workload clean, no observable regression vs the pre-fix state. Pin migration plan unblocked. With v7.65 carrying P38B + P15B + PN26b + PN25 + Cliff 8 hardening + P68/P69 threshold default, master can move to v0.20.1rc1.dev16 + Genesis v7.65 in one PR. Holding for v7.65 release tag — happy to test against any RC you cut. CONFIGS.md fixes look great — pulled Update for the bare-metal launch header would be |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the tests and data. They are extremely important to me and help make the project better and of higher quality. I haven't finished with the new PN26b sparse-V kernel yet; in fact, I've been working on it for the last 6 hours while also fixing bugs. I read all the comments and use a bot to track all repository activity, so I instantly see bug reports and suggestions from the community. I try to implement them as long as they don't distract too much from the project's main direction. For the next 2-3 days, I don't plan on pushing anything to the main branch. Everything will go to dev for now. Once both you and I are confident that everything is solid, I will merge the updates into main. This is how our workflow will operate moving forward: dev: for testing and new features main: for stable releases I apologize that I don't always reply. There's just not enough time for everything, so I dedicate most of it to the project and other personal priorities. If my lack of responses comes across as rude at times, please forgive me—that is absolutely not my intention. Have a great time of day, everyone (whether it's morning, afternoon, or late evening like it is for me right now). I try to hear everyone out, though it isn't always possible since we all have different perspectives on certain things. But that doesn't stop us from doing good and creating something valuable for all of us. Wishing everyone peace and a clear sky! |
Beta Was this translation helpful? Give feedback.
-
|
Hey @noonghunna and everyone — substantial update. Pushed v7.66 to dev (commits 1304c56..fc89395) and live-validated on 4 model configs on our 2× A5000 rig. Boot-tested all of them end-to-end against the actual vllm install, not just unit tests — sanity check after caught two real bugs that pytest missed. The patches, by statusPN33 — root-cause spec-decode warmup fix (DEFAULT ON) Backport of vllm-project/vllm#37521 (itailang) but EXTENDED beyond its Default ON when spec-decode is active. Disable via Live-verified: PN33 marker present in patched PN25 + P7b v7.66 — direct_register_custom_op refactor Switched Live-verified: PN25 v7.67 — REJECTED on live test Tried Stack showed Dynamo tracing INTO SGLang's working PN32 — audit only, no code change Confirmed Live-validation matrix
All 4 configs: PN33 patch APPLY (verified in live The 27B INT4 + DFlash drafter result (129.3 TPS on 2× A5000) lines up well against your published 78 narr / 128 code TPS on 2× 3090 — same drafter recipe, similar consumer Ampere. Known sharp edges
What I'd love help with
Honest note — stepping back for a few daysPretty wrung out from the last week. Going to read what people post but reply windows might be slower for 2-3 days. Keep the data coming whenever it shows up — every bench result and config detail matters. Your dual-3090 wall_TPS 116.59 number (+30.7% over A5000 reference) is exactly the kind of validation that justifies the effort. Thanks for the patient bug-hunting. Wishing peace and a clear sky to everyone. — Sander, Ukraine, Odessa |
Beta Was this translation helpful? Give feedback.
-
|
@noonghunna and the @ChatGPT/Codex CLI team — thank you. Big update. Pulled all three of your v7.66 cross-rig findings into Genesis directly as v7.68 (commit ab3f5ce on dev). Boot-validated on our 27B INT4 + TQ k8v4 + MTP K=3 + TP=2 PROD; ready for your 1×3090 + TP=1 retest whenever you have a window. What landed in Genesis directlyPN30 v7.68 — dst-shaped temp (your Your diagnosis was correct end-to-end. v7.65 Ported your fix as PN30 part3 patching Plus part1 (the old compact path) is now fail-closed RuntimeError so if anything ever reaches it we crash explicitly rather than silently corrupt. Reuses your existing PN25 v7.68 — import-time registration (your Your insight about activation.py module-import timing is the key — vLLM imports activation.py during model construction in each spawned worker, BEFORE profile_run enters aot_compile_fullgraph. So registration runs in eager Python, never inside a Dynamo trace. v7.66's Ported your
Same pattern extended preventively to P7b ( PN34 (NEW) — runtime workspace lock relaxation (your Your Default OFF (it's relaxing a strict-debug assertion, so explicit opt-in via Why I missed all three the first time — honest answerI tested patch application (does the text-patch land cleanly), not patch correctness against the bug-triggering workload. Specifically:
The system fix is on me: I'm setting up another rig with one consumer card next week to actually run your reproducers locally. No more arguing for workarounds when the right answer is to test against the actual bug surface. Server validation (post-backport)27B INT4 + TQ k8v4 + MTP K=3 + TP=2:
What I'd love help with nextWhen you have a window:
If anything regresses I'd rather hear about it within a day than have you carry sidecars forever. On the next-week test rigSetting up a second box with a single A5000 (24 GB, SM86 Ampere consumer — same memory budget + same compute capability as the 3090 you're testing on) to actually run your reproducers locally instead of asking you to do all the cross-rig validation. Specifically:
Not a substitute for your cross-rig data on the 3090s themselves, but should mean fewer "works on TP=2, breaks on TP=1" round-trips through your bug filings — A5000 single-card hits the same TP=1 spawn config + 24 GB activation budget that triggered all three of the bugs you found. — Sander, Ukraine, Odessa |
Beta Was this translation helpful? Give feedback.
-
|
Quick follow-up — ran two static-analysis audits today (Gemini + ChatGPT/Codex CLI) on the genesis-vllm-patches tree to catch latent issues that pytest + live-boot couldn't. Closing the loop on what they surfaced. Real bugs caught + fixedG-001 (Codex, Critical) — G-002 (Codex, High) — G-003 + G-004 (Codex, High×2) — G-006 (Codex, Medium) — G-007 (Codex, Medium) — G-008 (Codex, Medium) — 7 env-var references in PATCHES.md / INSTALL.md didn't match the actual P103 latent NameError (separate Gemini audit) — Plus cleanup passThe same audits flagged G-005 (streaming docs lying about SSE replay), G-009 (PATCHES.md P72 row truncated), G-010 (rig-specific paths in scripts — partially closed with env-var override + README rationale), G-011 ( Numbers
Honest noteThe two latent bugs that hurt most (G-001 conservative apply override; P103 silent Cliff 2 skip) are exactly the class that pytest + live-boot don't catch — boot doesn't trigger the rare exception path, and PROD continuous batching never crosses the chunked-prefill threshold. Static analysis (ruff F821, name resolution) found them in 30 seconds. Going to bake static analysis into a pre-commit hook so this is automated going forward. Stepping away for the rest of today — eyes are tired. Will read what comes in but reply windows likely tomorrow. Whatever cross-rig data you turn around on PN30 v7.68 / PN25 v7.68 / PN34 will be valuable whenever it lands. — Sander, Ukraine, Odessa |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — pulled v7.68 dev tip ( Three findings worth flagging before you cut v7.69. TL;DR: PN25 v7.68 ✅ + PN34 ✅. PN30 v7.68 ❌, P103 ❌, PN32 ❌ on TP=1 + 24GB. ✅ PN25 v7.68 — works clean on TP=1
✅ PN34 — works clean (but default OFF caught us)
❌ PN30 v7.68 — drift-marker false-positive breaks the patchYour part3 has apply_all then escalates to FAILED → vLLM aborts. Fix: change part3's drift marker to something part3-specific, e.g. ❌ P103 — wrap reports "rebound at 0 caller sites"Confirmed broken: probe 7 (60K) hit Cliff 2 OOM and the trace went straight through ❌ PN32 alone doesn't close Cliff 2 on TP=1 + 24GBAfter enabling PN32 + (broken) P103 with PN30 disabled (workaround for Finding 1) → Cliff 2 fired EARLIER than v7.66, at a 30K prompt instead of the usual 50-60K. PN32 chunks OOM trace at 30K: Two ways this could land:
The 2×A5000 PROD doesn't hit this because TP=2 splits the GDN forward state across ranks; on TP=1 the full state lands on one card. What we didStayed on master (Genesis v7.66 Happy to share full boot logs, dispatcher matrix, and OOM tracebacks for any of these — let me know which would be most useful for v7.69 triage. |
Beta Was this translation helpful? Give feedback.
-
|
@noonghunna — pulled all three findings into v7.69 on dev (commit F1 — PN30 part3 drift-marker false-positive ✅ FIXEDYour diagnosis was exact. part2 (separate Tightened part3's drift markers to F2 — P103 setattr lost on
|
Beta Was this translation helpful? Give feedback.
-
|
If this version passes validation and doesn't crash on your end, I will merge it into the main branch and we can lock in the first stable release :). It would be incredibly helpful if people could keep sharing their testing data... |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — first off, v7.69 turnaround speed is something else. F1 + F2 + F3 all rooted causes diagnosed correctly in your replies, 18 new tests, dispatcher composition matrix for ✅ F1 (PN30 part3 drift-marker) — confirmed workingDS layout active throughout. PN30 v7.68 part1+2+3 all APPLY clean, no ✅ F2 (P103 self-install hook in chunk.py) — confirmed firing on TP=1Trace at runtime hits
|
| T value | Invocations | Notes |
|---|---|---|
| 4128 | 394 | vLLM's chunked-prefill chunk size (capped by max_num_batched_tokens=4128) |
| 64 | 48 | cudagraph warmup or MTP verify path |
| > 4128 | 0 | Never seen |
q.shape[0] = 1 always. cu_shape = torch.Size([2]) always. So _single_seq_cu = True and _true_varlen_multi_seq = False for every invocation.
The P103 chunked path never engages because q.shape[1] <= _MAX_T (4128 ≤ 16384) is always true on real serving — vLLM's outer chunked-prefill is already capping T at 4128, well below MAX_T. PN32 v2's outer-level chunking has the same effect (chunk size 8192, threshold 16384). Neither closure mechanism fires.
We tested forcing it via GENESIS_FLA_FWD_H_MAX_T=2048. Chunked path engaged, per-call allocation halved (50→24 MiB), but cumulative state grew slightly and OOM fired earlier in absolute call count. Confirmed Codex's hypothesis: the issue isn't a single allocation that needs splitting; it's accumulated activation residency that the 50 MiB late-stage allocation can't fit into.
The actual closure: vllm#35975 backport + mem-util tuning + bisect data
ChatGPT/Codex round 2 diagnosed it as headroom rather than gate logic, and called out vllm#35975 (open) as a directly-relevant fix — skips inputs_embeds GPU buffer for text-only models, claims ~64 MiB savings.
We backported it locally as a setup-time text-patch. Combined with mem-util tuning, the matrix:
| Config | Boot resident | 60K MTP-on | Wall | Notes |
|---|---|---|---|---|
| 0.95 (baseline) | 23,164 MiB | ❌ OOM 50/24.5 free | n/a | Cliff 2 fires |
| 0.95 + #35975 | 22,720 MiB | ❌ OOM 50/46.5 free | n/a | #35975 freed 444 MiB at boot, only 22 MiB margin at peak |
| 0.92 + #35975 | 21,980 MiB | ✅ HTTP 200 | 689s | Cliff 2 closed. ~580 MiB end-of-run margin, AL=4.00 |
| 0.93 + #35975 | 22,260 MiB | ✅ HTTP 200 | 623s | Cliff 2 closed. ~494 MiB margin. Best balanced point |
Plus MTP-off + 0.95: 60K passes in 504s with full 5+ GiB KV pool — different shipping variant for users who want max KV pool.
Per Codex's framing post-bisect, three explicit variants to ship:
- Balanced MTP —
long-text.ymlupdated: Genesis v7.69 + your full env bundle + Codex P103 gate fix + #35975 sidecar + 0.93 mem-util + MTP K=3 retained. Cliff 2 closed at 60K with spec-decode acceleration. KV concurrency at 180K: ~1.4x. - Max-context safety —
long-text-no-mtp.yml: Same minus--speculative-config, full 0.95 mem-util. For long-shot RAG/codebase prompts where slow decode is OK in exchange for max KV pool stability. - Future upstream win — vllm#37429 hybrid Mamba/attention KV cache sizing. If it applies cleanly, could free residency without trading mem-util. Untested, separate branch experiment.
90K probe at 0.93 + max_tokens=1 (prefill-only timing) is in flight as we draft this; result will update the recipe with a confirmed Cliff 2 ceiling figure.
On Codex's P103 gate fix recommendation
We applied the gate change anyway:
-if cu_seqlens is not None or q.shape[1] <= _MAX_T:
+_single_seq_cu = (cu_seqlens is not None and q.shape[0] == 1
+ and cu_seqlens.shape[0] == 2)
+_true_varlen_multi_seq = cu_seqlens is not None and not _single_seq_cu
+if _true_varlen_multi_seq or q.shape[1] <= _MAX_T:Plus canonicalize cu_seqlens=None inside the chunked path (since [0,T] is dense B=1 semantically). It's the right semantic fix — the previous gate blocked single-seq cu_seqlens unnecessarily, which would matter on configs without vLLM's outer chunked-prefill capping T. Worth shipping in v7.70 even though it's not what closes Cliff 2 on our specific config. Diff in this discussion's attached file (or I can open a PR if you'd prefer).
PN32 v2's analogous gate (multi-seq bypass) has the same property. For users running spec-decode + long single-prompt, the right composition is: chunked-prefill at outer level (4128 cap) + #35975 freeing residency + 0.92 mem-util freeing activation budget. P103's chunked path is a defense-in-depth for cases where T > 16384 reaches FLA directly (synthetic benchmarks, non-chunked-prefill configs).
Cross-stack signal: PFlash from Luce-Org
@troymroberts surfaced this in club-3090#25 — Sandro Puppo's announcement + the lucebox blog. Not asking you to integrate, but flagging because the architectural overlap with PN26b is interesting:
- PFlash is a long-context prefill accelerator (vs DFlash which is decode). Uses small drafter (Qwen3-0.6B) + block-sparse attention to score token importance, compresses 128K → ~6.5K tokens before target prefill runs.
- Headline claim: TTFT 24.8s vs 257s vanilla llama.cpp at 128K (~10.4× speedup)
- Block-sparse drafter attention on SM86 — the kernel surface is exactly what your PN26b targets. If a community vLLM port of PFlash emerges, your existing sparse-V infrastructure could be directly reusable.
- C++/CUDA only today, lives in lucebox-hub. Tracked in our
docs/UPSTREAM.mdas a watch entry.
club-3090 plans to explore integration once lucebox-hub server stabilizes (currently has the daemon-mode + greedy-only quirks documented). Mentioning here in case you were unaware of it.
📱 Twitter handle?
Last housekeeping ask: what's your Twitter / X handle? We're starting to do more public posts (rewrote the pinned welcome to "club-3090 is open to all CUDA hardware" yesterday; likely more once v7.69 lands stable + we have NVLink + 4×3090 + modded-3080 cross-rig data points). We want to credit you properly when posting — your patches are the single biggest reason this stack performs the way it does.
If you'd rather stay off social platforms, that's fine — just say so and we'll credit you as @Sandermage on GitHub instead.
On Blackwell server > consumer
Your reasoning (24/7 reliability, ECC, lower power per useful FLOP) is the right framing for PROD. When the Blackwell server tier comes within reach, the cross-rig story flips — we become your test surface for Ampere consumer, you become the reference for Blackwell datacenter. Useful split.
Wishing peace and clear sky.
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the insights. I’m also considering integrating Codex into my validation workflow on a permanent basis; it’s a great time-saver for catching things I might have missed or messed up. Regarding PFlash, I’ve looked into it, but it’s not a top priority for now, even though the tech is interesting. We’ll see how it goes—I don’t want to make promises I can’t keep. Right now, my main focus is stabilizing the entire stack; new features and improvements will come after that. I’m currently reworking the structure and developing an installer so that everything can be set up with a single click or command, bringing the whole project into a more logical and streamlined form. I’ll be away from my desk until Monday. Taking a little break to recharge, otherwise my brain is going to go 'boom' :). All fixes and updates will resume on Monday. I’m not very active on Twitter (X) yet—mostly just reading—but you can find me here: X: https://x.com/AleksandrBarzov Instagram: https://www.instagram.com/sander_odessa/ Facebook: https://www.facebook.com/sander.odessa/ I probably need to change my approach to social media, since the patcher and everything I’m doing is primarily aimed at the English-speaking community. Thanks again for the feedback, and have a great weekend! |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — the PN95 tier-aware-cache shape is exactly the right primitive set. Keeping Mamba SSM state on GPU ( Two things from our side that might be worth threading into the 1. vllm#41434 landed 2026-05-08 — eliminates several GPU↔CPU syncs in attention impls. Measuring ~15% vanilla-path TPS lift on Qwen3-Next between pre-#41434 ( 2. Cross-model head-to-head this week — Discussion #119: Gemma 4 31B vs Qwen 3.6 27B on dual 3090 across 6 configs (INT8 PTH / bf16 / TQ3 patch-only / TQ3+MTP Genesis-backed), same vLLM pin + rebench-full harness for legs 1-5, Genesis leg pinned at your allowlist No pressure on the rename/refactor timing — the framework rework is clearly the right architectural move. Happy to fold a |
Beta Was this translation helpful? Give feedback.
-
|
As soon as I can release the new version, everything will become much simpler... and better. |
Beta Was this translation helpful? Give feedback.
-
|
It all started as a quick, makeshift project for myself—just tweaking things here and there, and suddenly it was published. |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — 50-60% on a project whose scope has grown like this is genuinely impressive. The pivot from "collection of patches" to "deployment platform with config profiles + a review pipeline for outside contributors" is the architectural shift that separates a hobby fork from infrastructure. Most community patch work caps out at the patch surface; you're past that. Rooting for you. When sndr_core is ready for cross-rig testing, we have the Two upstream-vLLM items we've been tracking — no action needed, just so they're on your radar for the next pin-allowlist refresh:
Ship it when it's right. 🍻 |
Beta Was this translation helpful? Give feedback.
-
|
Hi everyone, |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
I’ve updated the project and pushed it to GitHub. |
Beta Was this translation helpful? Give feedback.
-
|
🎉 Congratulations, Sander — this has been a long road and it's great to see sndr_core actually land. We've followed the whole arc here (the rename, the pivot from a patch collection to a full deploy platform with profiles + a contributor pipeline, the Gemma 4 / DiffusionGemma work, the new Web GUI), and shipping it with a 3–7 day pin cadence is exactly the piece that was missing. Honest status from our side: our Genesis-backed composes are currently retired. During the long gap the nightly image they were pinned to got purged from Docker Hub (the usual mutable-nightly problem), and with the Genesis path on hold awaiting this rework, we archived them rather than ship users a config that 404s on pull. Your new cadence fixes that at the root — pins tracking current, supported nightlies on a regular schedule are exactly what keeps these alive instead of rotting between releases. So we'll work on reviving them in due course: re-anchor against the sndr_core release on a current pin, re-run our full Really happy for you — ship it. 🍻 |
Beta Was this translation helpful? Give feedback.
-
|
Updates for the new pins will be continuous. It took a long time because I reworked and rethought a lot of things, and there was always something I wasn't happy with. |
Beta Was this translation helpful? Give feedback.
-
|
The patch-site address-change check is the right architecture for a fast cadence — the failure mode that bites with vendored patches isn't a clean apply-failure, it's a patch landing on a code path that quietly moved and mis-hooking, only caught later by corrupted output. Verifying the site before applying and failing loud is what makes frequent pin bumps safe. The capability we valued most was TurboQuant sub-8-bit KV on vLLM — on Ampere, vanilla vLLM only gives us fp8 or int8-PTH KV, and TQ3 was the one way to cut KV-pool pressure below that without leaving vLLM (big on VRAM-tight serving). Two questions before we plan anything: (1) did the TurboQuant KV path survive the rework into sndr_core, and on roughly which vLLM pin? (2) Is that KV backend consumable standalone — vendored as a patch onto our own stock vLLM image — or is it now coupled to the sndr_core launcher/profiles? Our integration is compose/registry-driven on our own base, so those two answers decide whether (and how cheaply) we can bring it back. Keen to look either way. 🍻 |
Beta Was this translation helpful? Give feedback.
-
|
I don't just copy patch fixes from the vLLM repository; I adapt them by studying the code, often rewriting it to make it more structurally sound and correct. I also analyze exactly what is being changed and how, actively looking to see if a better solution is possible. |
Beta Was this translation helpful? Give feedback.
-
|
This is impressive work, and the multi-engine hub direction — vLLM now, SGLang next, one cross-platform front-end — is genuinely ambitious and coherent for the "I just want it to run" audience. Here's the honest shape of how we'd fit it, and the one question that decides it. We already run our own multi-engine orchestration — a curated catalog that boots across vLLM, llama.cpp, SGLang, and community forks (e.g. we treat Anbeeld's beellama.cpp as a first-class engine: pin its image, wrap it in our composes, drive it headless under our launcher + TUI). So the hub / cross-engine management layer is the part we've already built — we wouldn't adopt sndr_core there; it'd overlap our own stack. What's genuinely interesting to us is the layer below the hub — your per-engine patched builds. The deciding question: is each engine sndr_core supports — your patched vLLM today, your patched SGLang later — independently runnable as a standalone, headless, pinnable serving backend, without the hub? If yes, we'd consume them exactly like any other engine in our catalog (a lane for your vLLM, later one for your SGLang), getting your patch work across engines while keeping our own orchestration — and you keep full ownership of the hub for your audience. If the value only comes together through the hub, that's a fair design — just not one we can slot in, since that's our layer. So: standalone-consumable per engine, or hub-centric? That single answer tells us exactly how — and whether — we plug in. Either way, rooting for it. 🍻 |
Beta Was this translation helpful? Give feedback.
-
|
Quick, direct answers to your two technical questions (comment 43) and the architecture one (comment 45) — thanks for the precise framing. 1. Did TurboQuant sub-8-bit KV (TQ3 / k8v4) survive the rework, and on which pin?Yes — it's core, not a casualty. TurboQuant k8v4 KV is exactly what both PROD paths run today: the 27B (Qwen3.6-27B int4 AutoRound, hybrid GDN+Mamba) and the 35B (Qwen3.6-35B-A3B FP8), validated on the current pin 2. Is the TQ KV backend vendorable standalone — a patch onto a stock vLLM image — or coupled to the launcher/profiles?Standalone-vendorable. The patches are a runtime overlay, not a fork: they live under 3. Standalone-consumable per engine, or hub-centric?Standalone per engine — that's the honest design answer, and it's the one you want. The hub (the To revive a Genesis-backed lane, concretely
Happy to hand you the exact env-flag set for the TQ-k8v4 + MTP path so the lane boots first try. One more on your mutable-nightly-purge pain: we now pin by the explicit-SHA nightly tag and document the digest per release, so a revived lane won't 404 on a GC'd |
Beta Was this translation helpful? Give feedback.
-
|
This is exactly the answer we were hoping for — thank you for the precision. Playing it back so we're aligned:
Yes please — I'll take the exact env-flag set for the TQ-k8v4 + MTP path. One ask on durability — and I think it's the natural next step for the project, not just for us. We consume community engines as pre-built, pinned images (e.g. Anbeeld's beellama: pin the official image, wrap it in our compose). You raised the nightly-purge point yourself, and it's worse than the digest-doc fixes: explicit-SHA Genuinely great that the rework kept the patch layer cleanly separable — that's what makes this worth doing. Send the flags and we'll get the lane booting. 🍻 |
Beta Was this translation helpful? Give feedback.
-
|
One small naming thing while we're here, since you're mid-rename: what's the canonical name you'd like used going forward for the engine / patch overlay? We'll name our lane + docs to match it, so anyone hopping between our catalog and your project sees one consistent term. Specifically:
We're leaning |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
Just shipped v7.64. Tried to address everything that came out of cross-rig
work over the last couple weeks, especially the cliffs you have been hitting
on the 3090s.
What is in
Bug fixes that close existing issues:
falls through to the broken upstream path under TQ k8v4 + FULL_AND_PIECEWISE
cudagraph. Tool-call went 0/5 → 7/7 on the 2× A5000 validation. Closes the
GQA-pow-2 compile error class.
silently skipping. PN17 frees 50-100 MiB on long-context FA2 (resolves
Cliff 1 mech A from your diagnosis), PN19 frees 200-500 MiB during model load.
wall TPS) but regress 35B FP8 (−4%). 27B default carries them, 35B default
does not. Documented per-model so nobody auto-enables across configs.
New launch script variants:
6 new docs files:
docs/GLOSSARY.md(terms),docs/HARDWARE.md(VRAMbudget + GPU class),
docs/FAQ.md,docs/CONFIGS.md(add-your-own-modelwalkthrough),
docs/CLIFFS.md(8 cliffs catalogued),CONTRIBUTING.md.Repo structure cleanup — tried to make navigation obvious so future-me
does not get lost. Doc map in README, per-launch scripts named by KV dtype +
workload.
Asking for
our A5000 numbers (95-100 TPS @ 256-512t). If it does not on 3090, that is
interesting and I want to know.
CONFIGS.md. Was the walkthrough enoughto add your own model? What was missing?
PR #40898 (DFlash SWA, +25% acceptance length), PR #39419 (local argmax TP,
+9-30% on TP=2), PR #41306 mitigation (
--moe-backend=triton). Which ofthese would matter most for your workload?
If something looks off — please tell me. Tests on my side show no problem,
but I am open to being wrong if you have a counter-example.
Cheers and thanks for keeping this thing honest.
Beta Was this translation helpful? Give feedback.
All reactions