server: add margin for draft model for fit#23485
Conversation
| if (i < params_base.fit_params_target.size()) { | ||
| SRV_DBG("[spec] adding %.2f MiB to fit_params_target for device %s\n", | ||
| bytes / (1024.0 * 1024.0), ggml_backend_dev_name(devs[j])); | ||
| params_base.fit_params_target[i] += bytes; |
There was a problem hiding this comment.
maybe out of scope, but would it be worth tracking the original fit_params_target and if loading MTP goes over the GPU threshold, but not loading it wouldn't have, print a warning?
|
Hopefully I won't clutter this PR, but does any of this affect the scenario where the draft model is loaded with |
|
Tested this branch on my 5070ti+7900xt setup (vulkan), looks like fit does not quite work as expected (at least — expected by me). What's interesting, VRAM usage increased in steps:
After load any request to the model triggers OOM on amdgpu. Please note, that i set amdgpu.gttsize=32, because llama.cpp has very annoying tendency to use GTT and tank performance. |
Overview
Use
common_get_device_memory_datato estimate fit params for draft model, and add that as margin before calling thefiton the main model. Resolves #23472Additional information
Requirements