server: add margin for draft model for `fit` by am17an · Pull Request #23485 · ggml-org/llama.cpp

am17an · 2026-05-21T16:23:55Z

Overview

Use common_get_device_memory_data to estimate fit params for draft model, and add that as margin before calling the fit on the main model. Resolves #23472

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, my local Qwen helped a bit.

bartowski1182 · 2026-05-21T18:30:11Z

+                                if (i < params_base.fit_params_target.size()) {
+                                    SRV_DBG("[spec] adding %.2f MiB to fit_params_target for device %s\n",
+                                            bytes / (1024.0 * 1024.0), ggml_backend_dev_name(devs[j]));
+                                    params_base.fit_params_target[i] += bytes;


maybe out of scope, but would it be worth tracking the original fit_params_target and if loading MTP goes over the GPU threshold, but not loading it wouldn't have, print a warning?

miloslavnosek · 2026-05-21T18:33:52Z

Hopefully I won't clutter this PR, but does any of this affect the scenario where the draft model is loaded with spec-draft-model = xx.gguf rather than being baked in?

sourenaraya · 2026-05-21T19:34:14Z

Tested this branch on my 5070ti+7900xt setup (vulkan), looks like fit does not quite work as expected (at least — expected by me).
Loaded Qwen3.6-27B-UD-Q4_K_XL, VRAM usage ended up something like this:
5070Ti: ~13.4Gb/15.9
7900XT: ~19.6/19.9 (maxed out, basically).

What's interesting, VRAM usage increased in steps:

7+9.5
13.4+16.2
13.4+19.6

After load any request to the model triggers OOM on amdgpu. Please note, that i set amdgpu.gttsize=32, because llama.cpp has very annoying tendency to use GTT and tank performance.
I end up tuning tensor-split manually to not spill to GTT.
fit.log

server: add margin for draft model for fit

c473dc0

am17an requested review from a team and JohannesGaessler as code owners May 21, 2026 16:23

am17an mentioned this pull request May 21, 2026

Eval bug: -fit on ignoring the VRAM required by the built-in MTP leads to OOM crash #23472

Open

bartowski1182 reviewed May 21, 2026

View reviewed changes

github-actions Bot added examples server labels May 21, 2026

use fit_params_min_ctx as floor for draft ctx

0b2f246

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add margin for draft model for `fit`#23485

server: add margin for draft model for `fit`#23485
am17an wants to merge 2 commits into
ggml-org:masterfrom
am17an:am17an/fix-fit-mtp-oom

am17an commented May 21, 2026 •

edited

Loading

Uh oh!

bartowski1182 May 21, 2026

Uh oh!

miloslavnosek commented May 21, 2026

Uh oh!

sourenaraya commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

am17an commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

bartowski1182 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

miloslavnosek commented May 21, 2026

Uh oh!

sourenaraya commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented May 21, 2026 •

edited

Loading

sourenaraya commented May 21, 2026 •

edited

Loading