Skip to content

server: add margin for draft model for fit#23485

Open
am17an wants to merge 2 commits into
ggml-org:masterfrom
am17an:am17an/fix-fit-mtp-oom
Open

server: add margin for draft model for fit#23485
am17an wants to merge 2 commits into
ggml-org:masterfrom
am17an:am17an/fix-fit-mtp-oom

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 21, 2026

Overview

Use common_get_device_memory_data to estimate fit params for draft model, and add that as margin before calling the fit on the main model. Resolves #23472

Additional information

Requirements

if (i < params_base.fit_params_target.size()) {
SRV_DBG("[spec] adding %.2f MiB to fit_params_target for device %s\n",
bytes / (1024.0 * 1024.0), ggml_backend_dev_name(devs[j]));
params_base.fit_params_target[i] += bytes;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe out of scope, but would it be worth tracking the original fit_params_target and if loading MTP goes over the GPU threshold, but not loading it wouldn't have, print a warning?

@miloslavnosek
Copy link
Copy Markdown

Hopefully I won't clutter this PR, but does any of this affect the scenario where the draft model is loaded with spec-draft-model = xx.gguf rather than being baked in?

@sourenaraya
Copy link
Copy Markdown

sourenaraya commented May 21, 2026

Tested this branch on my 5070ti+7900xt setup (vulkan), looks like fit does not quite work as expected (at least — expected by me).
Loaded Qwen3.6-27B-UD-Q4_K_XL, VRAM usage ended up something like this:
5070Ti: ~13.4Gb/15.9
7900XT: ~19.6/19.9 (maxed out, basically).

What's interesting, VRAM usage increased in steps:

  1. 7+9.5
  2. 13.4+16.2
  3. 13.4+19.6

After load any request to the model triggers OOM on amdgpu. Please note, that i set amdgpu.gttsize=32, because llama.cpp has very annoying tendency to use GTT and tank performance.
I end up tuning tensor-split manually to not spill to GTT.
fit.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: -fit on ignoring the VRAM required by the built-in MTP leads to OOM crash

4 participants