server: merge custom preset aliases into existing HF cached model entries#23952
Open
batyrrasulov wants to merge 1 commit into
Open
server: merge custom preset aliases into existing HF cached model entries#23952batyrrasulov wants to merge 1 commit into
batyrrasulov wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary\n- when a custom section name differs from an already discovered cached model key but points to the same , merge into the existing model entry instead of creating a duplicate\n- keep the custom section name as an alias on the merged model so requests can still target it\n\n## Validation\n- -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: arm64
-- GGML_SYSTEM_ARCH: ARM
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
-- Including CPU backend
-- Accelerate framework found
-- ARM detected
-- Checking for ARM features using flags:
-- -U__ARM_FEATURE_SVE
-- -mcpu=native+dotprod+i8mm+nosve+sme
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SVE;-mcpu=native+dotprod+i8mm+nosve+sme
-- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Includes:
-- Including BLAS backend
-- Metal framework found
-- Including METAL backend
-- ggml version: 0.13.1
-- ggml commit: 9b2af95
-- OpenSSL found: 3.6.2
-- Generating embedded license file for target: llama-app
-- Configuring done (1.0s)
-- Generating done (0.6s)
-- Build files have been written to: /Users/batyr.rasulov/Documents/Codex/2026-05-31/need-you-to-make-open-source/work/llama.cpp/build\n- [ 2%] Building CXX object common/CMakeFiles/llama-common-base.dir/build-info.cpp.o
[ 2%] Built target cpp-httplib
[ 2%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[ 2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[ 2%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[ 2%] Linking CXX static library libllama-common-base.a
[ 4%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[ 4%] Built target llama-common-base
[ 4%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend-meta.cpp.o
[ 4%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[ 4%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[ 4%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[ 6%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
[ 6%] Linking CXX shared library ../../bin/libggml-base.dylib
[ 6%] Built target ggml-base
[ 6%] Linking CXX shared library ../../../bin/libggml-blas.dylib
[ 6%] Linking CXX shared library ../../../bin/libggml-metal.dylib
[ 6%] Linking CXX shared library ../../bin/libggml-cpu.dylib
[ 6%] Built target ggml-blas
[ 11%] Built target ggml-metal
[ 17%] Built target ggml-cpu
[ 17%] Linking CXX shared library ../../bin/libggml.dylib
[ 20%] Built target ggml
[ 20%] Linking CXX shared library ../bin/libllama.dylib
[ 86%] Built target llama
[ 86%] Linking CXX shared library ../bin/libllama-common.dylib
[100%] Built target llama-common
[100%] Linking CXX executable ../bin/test-arg-parser
[100%] Built target test-arg-parser\n- [ 1%] Built target llama-common-base
[ 1%] Built target llama-ui-embed
[ 1%] Built target cpp-httplib
[ 5%] Built target ggml-base
[ 5%] Provisioning UI assets
-- UI: running npm install (first time)
[ 5%] Built target ggml-blas
[ 9%] Built target ggml-metal
[ 14%] Built target ggml-cpu
[ 16%] Built target ggml
-- UI: npm install failed (1)
-- stderr: npm error code EBADENGINE
npm error engine Unsupported engine
npm error engine Not compatible with your version of node/npm: @chromatic-com/storybook@5.0.0
npm error notsup Not compatible with your version of node/npm: @chromatic-com/storybook@5.0.0
npm error notsup Required: {"node":">=20.0.0","yarn":">=1.22.18"}
npm error notsup Actual: {"npm":"10.8.2","node":"v18.20.8"}
npm error A complete log of this run can be found in: /Users/batyr.rasulov/.npm/_logs/2026-06-01T02_25_02_261Z-debug-0.log
-- UI: downloading from b9445: https://huggingface.co/buckets/ggml-org/llama-ui/resolve/b9445
[ 70%] Built target llama
-- UI: download bundle.css from b9445 failed: "HTTP response code said error"
-- UI: downloading from latest: https://huggingface.co/buckets/ggml-org/llama-ui/resolve/latest
[ 81%] Built target llama-common
[ 81%] Linking CXX shared library ../../bin/libmtmd.dylib
[ 92%] Built target mtmd
-- UI: downloaded bundle.css
[ 96%] Built target server-context
-- UI: downloaded bundle.js
-- UI: downloaded index.html
-- UI: downloaded loading.html
-- UI: verifying checksums
-- UI: all checksums verified
-- UI: HF download succeeded, stamp updated (latest)
[ 96%] Built target llama-ui-assets
[ 98%] Built target llama-ui
[ 98%] Linking CXX shared library ../../bin/libllama-server-impl.dylib
[100%] Built target llama-server-impl
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server\n- ----- common params -----
-h, --help, --usage print usage and exit
--version show version and build info
-cl, --cache-list show list of models in cache
--completion-bash print source-able bash completion script for llama.cpp
-t, --threads N number of CPU threads to use during generation (default: -1)
(env: LLAMA_ARG_THREADS)
-tb, --threads-batch N number of threads to use during batch and prompt processing (default:
same as --threads)
-C, --cpu-mask M CPU affinity mask: arbitrarily long hex. Complements cpu-range
(default: "")
-Cr, --cpu-range lo-hi range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1> use strict CPU placement (default: 0)
--prio N set process/thread priority : low(-1), normal(0), medium(1), high(2),
realtime(3) (default: 0)
--poll <0...100> use polling level to wait for work (0 - no polling, default: 50)
-Cb, --cpu-mask-batch M CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
(default: same as --cpu-mask)
-Crb, --cpu-range-batch lo-hi ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1> use strict CPU placement (default: same as --cpu-strict)
--prio-batch N set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
(default: 0)
--poll-batch <0|1> use polling to wait for work (default: same as --poll)
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE)
-n, --predict, --n-predict N number of tokens to predict (default: -1, -1 = infinity)
(env: LLAMA_ARG_N_PREDICT)
-b, --batch-size N logical maximum batch size (default: 2048)
(env: LLAMA_ARG_BATCH)
-ub, --ubatch-size N physical maximum batch size (default: 512)
(env: LLAMA_ARG_UBATCH)
--keep N number of tokens to keep from the initial prompt (default: 0, -1 =
all)
--swa-full use full-size SWA cache (default: false)
(more
info)
(env: LLAMA_ARG_SWA_FULL)
-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
(env: LLAMA_ARG_FLASH_ATTN)
--perf, --no-perf whether to enable internal libllama performance timings (default:
false)
(env: LLAMA_ARG_PERF)
-e, --escape, --no-escape whether to process escapes sequences (\n, \r, \t, ', ", \)
(default: true)
--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by
the model
(env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N RoPE context scaling factor, expands context by a factor of N
(env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from
model)
(env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N
(env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training
context size)
(env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N YaRN: extrapolation mix factor (default: -1.00, 0.0 = full
interpolation)
(env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: -1.00)
(env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N YaRN: high correction dim or alpha (default: -1.00)
(env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N YaRN: low correction dim or beta (default: -1.00)
(env: LLAMA_ARG_YARN_BETA_FAST)
-kvo, --kv-offload, -nkvo, --no-kv-offload
whether to enable KV cache offloading (default: enabled)
(env: LLAMA_ARG_KV_OFFLOAD)
--repack, -nr, --no-repack whether to enable weight repacking (default: enabled)
(env: LLAMA_ARG_REPACK)
--no-host bypass host buffer allowing extra buffers to be used
(env: LLAMA_ARG_NO_HOST)
-ctk, --cache-type-k TYPE KV cache data type for K
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K)
-ctv, --cache-type-v TYPE KV cache data type for V
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V)
-dt, --defrag-thold N KV cache defragmentation threshold (DEPRECATED)
(env: LLAMA_ARG_DEFRAG_THOLD)
--mlock force system to keep model in RAM rather than swapping or compressing
(env: LLAMA_ARG_MLOCK)
--mmap, --no-mmap whether to memory-map model. (if mmap disabled, slower load but may
reduce pageouts if not using mlock) (default: enabled)
(env: LLAMA_ARG_MMAP)
-dio, --direct-io, -ndio, --no-direct-io
use DirectIO if available. (default: disabled)
(env: LLAMA_ARG_DIO)
--numa TYPE attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution
started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system
page cache before using this
see #1437
(env: LLAMA_ARG_NUMA)
-dev, --device <dev1,dev2,..> comma-separated list of devices to use for offloading (none = don't
offload)
use --list-devices to see a list of available devices
(env: LLAMA_ARG_DEVICE)
--list-devices print list of available devices and exit
-ot, --override-tensor =,...
override tensor buffer type
(env: LLAMA_ARG_OVERRIDE_TENSOR)
-cmoe, --cpu-moe keep all Mixture of Experts (MoE) weights in the CPU
(env: LLAMA_ARG_CPU_MOE)
-ncmoe, --n-cpu-moe N keep the Mixture of Experts (MoE) weights of the first N layers in the
CPU
(env: LLAMA_ARG_N_CPU_MOE)
-ngl, --gpu-layers, --n-gpu-layers N max. number of layers to store in VRAM, either an exact number,
'auto', or 'all' (default: auto)
(env: LLAMA_ARG_N_GPU_LAYERS)
-sm, --split-mode {none,layer,row,tensor}
how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs (pipelined)
- row: split weight across GPUs by rows (parallelized)
- tensor: split weights and KV across GPUs (parallelized,
EXPERIMENTAL)
(env: LLAMA_ARG_SPLIT_MODE)
-ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of
proportions, e.g. 3,1
(env: LLAMA_ARG_TENSOR_SPLIT)
-mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for
intermediate results and KV (with split-mode = row) (default: 0)
(env: LLAMA_ARG_MAIN_GPU)
-fit, --fit [on|off] whether to adjust unset arguments to fit in device memory ('on' or
'off', default: 'on')
(env: LLAMA_ARG_FIT)
-fitt, --fit-target MiB0,MiB1,MiB2,...
target margin per device for --fit, comma-separated list of values,
single value is broadcast across all devices, default: 1024
(env: LLAMA_ARG_FIT_TARGET)
-fitc, --fit-ctx N minimum ctx size that can be set by --fit option, default: 4096
(env: LLAMA_ARG_FIT_CTX)
--check-tensors check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE,... advanced option to override model metadata by key. to specify multiple
overrides, either use comma-separated values.
types: int, float, bool, str. example: --override-kv
tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false
--op-offload, --no-op-offload whether to offload host tensor operations to device (default: true)
--lora FNAME path to LoRA adapter (use comma-separated values to load multiple
adapters)
--lora-scaled FNAME:SCALE,... path to LoRA adapter with user defined scaling (format:
FNAME:SCALE,...)
note: use comma-separated values
--control-vector FNAME add a control vector
note: use comma-separated values to add multiple control vectors
--control-vector-scaled FNAME:SCALE,...
add a control vector with user defined scaling SCALE
note: use comma-separated values (format: FNAME:SCALE,...)
--control-vector-layer-range START END
layer range to apply the control vector(s) to, start and end inclusive
-m, --model FNAME model path to load
(env: LLAMA_ARG_MODEL)
-mu, --model-url MODEL_URL model download url (default: unused)
(env: LLAMA_ARG_MODEL_URL)
-dr, --docker-repo [/][:quant]
Docker Hub model repository. repo is optional, default to ai/. quant
is optional, default to :latest.
example: gemma3
(default: unused)
(env: LLAMA_ARG_DOCKER_REPO)
-hf, -hfr, --hf-repo /[:quant]
Hugging Face model repository; quant is optional, case-insensitive,
default to Q4_K_M, or falls back to the first file in the repo if
Q4_K_M doesn't exist.
mmproj is also downloaded automatically if available. to disable, add
--no-mmproj
example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
(default: unused)
(env: LLAMA_ARG_HF_REPO)
-hff, --hf-file FILE Hugging Face model file. If specified, it will override the quant in
--hf-repo (default: unused)
(env: LLAMA_ARG_HF_FILE)
-hfv, -hfrv, --hf-repo-v /[:quant]
Hugging Face model repository for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE Hugging Face model file for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_FILE_V)
-hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment
variable)
(env: HF_TOKEN)
--log-disable Log disable
--log-file FNAME Log to file
(env: LLAMA_ARG_LOG_FILE)
--log-colors [on|off|auto] Set colored logging ('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
(env: LLAMA_ARG_LOG_COLORS)
-v, --verbose, --log-verbose Set verbosity level to infinity (i.e. log all messages, useful for
debugging)
--offline Offline mode: forces use of cache, prevents network access
(env: LLAMA_ARG_OFFLINE)
-lv, --verbosity, --log-verbosity N Set the verbosity threshold. Messages with a higher verbosity will be
ignored. Values:
- 0: generic output
- 1: error
- 2: warning
- 3: info
- 4: trace (more info)
- 5: debug
(default: 3)
--log-prefix, --no-log-prefix Enable prefix in log messages
(env: LLAMA_ARG_LOG_PREFIX)
--log-timestamps, --no-log-timestamps Enable timestamps in log messages
(env: LLAMA_ARG_LOG_TIMESTAMPS)
--spec-draft-type-k, -ctkd, --cache-type-k-draft TYPE
KV cache data type for K for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_SPEC_DRAFT_CACHE_TYPE_K)
--spec-draft-type-v, -ctvd, --cache-type-v-draft TYPE
KV cache data type for V for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_SPEC_DRAFT_CACHE_TYPE_V)
----- sampling params -----
--samplers SAMPLERS samplers that will be used for generation in the order, separated by
';'
(default:
penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s, --seed SEED RNG seed (default: -1, use random seed for -1)
--sampler-seq, --sampling-seq SEQUENCE
simplified sequence for samplers that will be used (default:
edskypmxt)
--ignore-eos ignore end of stream token and continue generating (implies
--logit-bias EOS-inf)
--temp, --temperature N temperature (default: 0.80)
--top-k N top-k sampling (default: 40, 0 = disabled)
(env: LLAMA_ARG_TOP_K)
--top-p N top-p sampling (default: 0.95, 1.0 = disabled)
--min-p N min-p sampling (default: 0.05, 0.0 = disabled)
--top-nsigma, --top-n-sigma N top-n-sigma sampling (default: -1.00, -1.0 = disabled)
--xtc-probability N xtc probability (default: 0.00, 0.0 = disabled)
--xtc-threshold N xtc threshold (default: 0.10, 1.0 = disabled)
--typical, --typical-p N locally typical sampling, parameter p (default: 1.00, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1
= ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.00, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.00, 0.0 = disabled)
--dry-multiplier N set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N set DRY sampling base value (default: 1.75)
--dry-allowed-length N set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
context size)
--dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers
('\n', ':', '"', '*') in the process; use "none" to not use any
sequence breakers
--adaptive-target N adaptive-p: select tokens near this probability (valid range 0.0 to
1.0; negative = disabled) (default: -1.00)
(more info)
--adaptive-decay N adaptive-p: decay rate for target adaptation over time. lower values
are more reactive, higher values are more stable.
(valid range 0.0 to 0.99) (default: 0.90)
--dynatemp-range N dynamic temperature range (default: 0.00, 0.0 = disabled)
--dynatemp-exp N dynamic temperature exponent (default: 1.00)
--mirostat N use Mirostat sampling.
Top K, Nucleus and Locally Typical samplers are ignored if used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.10)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.00)
-l, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion,
i.e.
--logit-bias 15043+1to increase likelihood of token ' Hello',or
--logit-bias 15043-1to decrease likelihood of token ' Hello'--grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/
dir)
--grammar-file FNAME file to read grammar from
-j, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g.
{}for any JSON objectFor schemas w/ external $refs, use --grammar +
example/json_schema_to_grammar.py instead
-jf, --json-schema-file FILE File containing a JSON schema to constrain generations
(https://json-schema.org/), e.g.
{}for any JSON objectFor schemas w/ external $refs, use --grammar +
example/json_schema_to_grammar.py instead
-bs, --backend-sampling enable backend sampling (experimental) (default: disabled)
(env: LLAMA_ARG_BACKEND_SAMPLING)
----- speculative params -----
--spec-draft-hf, -hfd, -hfrd, --hf-repo-draft /[:quant]
Same as --hf-repo, but for the draft model (default: unused)
(env: LLAMA_ARG_SPEC_DRAFT_HF_REPO)
--spec-draft-threads, -td, --threads-draft N
number of threads to use during generation (default: same as
--threads)
--spec-draft-threads-batch, -tbd, --threads-batch-draft N
number of threads to use during batch and prompt processing (default:
same as --threads-draft)
--spec-draft-cpu-mask, -Cd, --cpu-mask-draft M
Draft model CPU affinity mask. Complements cpu-range-draft (default:
same as --cpu-mask)
--spec-draft-cpu-range, -Crd, --cpu-range-draft lo-hi
Ranges of CPUs for affinity. Complements --cpu-mask-draft
--spec-draft-cpu-strict, --cpu-strict-draft <0|1>
Use strict CPU placement for draft model (default: same as
--cpu-strict)
--spec-draft-prio, --prio-draft N set draft process/thread priority : 0-normal, 1-medium, 2-high,
3-realtime (default: 0)
--spec-draft-poll, --poll-draft <0|1> Use polling to wait for draft model work (default: same as --poll)
--spec-draft-cpu-mask-batch, -Cbd, --cpu-mask-batch-draft M
Draft model CPU affinity mask. Complements cpu-range-draft (default:
same as --cpu-mask)
--spec-draft-cpu-strict-batch, --cpu-strict-batch-draft <0|1>
Use strict CPU placement for draft model (default: --cpu-strict-draft)
--spec-draft-prio-batch, --prio-batch-draft N
set draft process/thread priority : 0-normal, 1-medium, 2-high,
3-realtime (default: 0)
--spec-draft-poll-batch, --poll-batch-draft <0|1>
Use polling to wait for draft model work (default: --poll-draft)
--spec-draft-override-tensor, -otd, --override-tensor-draft =,...
override tensor buffer type for draft model
--spec-draft-cpu-moe, -cmoed, --cpu-moe-draft
keep all Mixture of Experts (MoE) weights in the CPU for the draft
model
(env: LLAMA_ARG_SPEC_DRAFT_CPU_MOE)
--spec-draft-n-cpu-moe, --spec-draft-ncmoe, -ncmoed, --n-cpu-moe-draft N
keep the Mixture of Experts (MoE) weights of the first N layers in the
CPU for the draft model
(env: LLAMA_ARG_SPEC_DRAFT_N_CPU_MOE)
--spec-draft-n-max N number of tokens to draft for speculative decoding (default: 3)
(env: LLAMA_ARG_SPEC_DRAFT_N_MAX)
--spec-draft-n-min N minimum number of draft tokens to use for speculative decoding
(default: 0)
(env: LLAMA_ARG_SPEC_DRAFT_N_MIN)
--spec-draft-p-split, --draft-p-split P
speculative decoding split probability (default: 0.10)
(env: LLAMA_ARG_SPEC_DRAFT_P_SPLIT)
--spec-draft-p-min, --draft-p-min P minimum speculative decoding probability (greedy) (default: 0.00)
(env: LLAMA_ARG_SPEC_DRAFT_P_MIN)
--spec-draft-backend-sampling, --no-spec-draft-backend-sampling
offload draft sampling to the backend (default: enabled)
(env: LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING)
--spec-draft-device, -devd, --device-draft <dev1,dev2,..>
comma-separated list of devices to use for offloading the draft model
(none = don't offload)
use --list-devices to see a list of available devices
--spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft N
max. number of draft model layers to store in VRAM, either an exact
number, 'auto', or 'all' (default: auto)
(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
--spec-draft-model, -md, --model-draft FNAME
draft model for speculative decoding (default: unused)
(env: LLAMA_ARG_SPEC_DRAFT_MODEL)
--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache
comma-separated list of types of speculative decoding to use (default:
none)
--spec-ngram-mod-n-min N minimum number of ngram tokens to use for ngram-based speculative
decoding (default: 48)
--spec-ngram-mod-n-max N maximum number of ngram tokens to use for ngram-based speculative
decoding (default: 64)
--spec-ngram-mod-n-match N ngram-mod lookup length (default: 24)
--spec-ngram-simple-size-n N ngram size N for ngram-simple speculative decoding, length of lookup
n-gram (default: 12)
--spec-ngram-simple-size-m N ngram size M for ngram-simple speculative decoding, length of draft
m-gram (default: 48)
--spec-ngram-simple-min-hits N minimum hits for ngram-simple speculative decoding (default: 1)
--spec-ngram-map-k-size-n N ngram size N for ngram-map-k speculative decoding, length of lookup
n-gram (default: 12)
--spec-ngram-map-k-size-m N ngram size M for ngram-map-k speculative decoding, length of draft
m-gram (default: 48)
--spec-ngram-map-k-min-hits N minimum hits for ngram-map-k speculative decoding (default: 1)
--spec-ngram-map-k4v-size-n N ngram size N for ngram-map-k4v speculative decoding, length of lookup
n-gram (default: 12)
--spec-ngram-map-k4v-size-m N ngram size M for ngram-map-k4v speculative decoding, length of draft
m-gram (default: 48)
--spec-ngram-map-k4v-min-hits N minimum hits for ngram-map-k4v speculative decoding (default: 1)
--draft, --draft-n, --draft-max N the argument has been removed. use --spec-draft-n-max or
--spec-ngram-mod-n-max
(env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N the argument has been removed. use --spec-draft-n-min or
--spec-ngram-mod-n-min
(env: LLAMA_ARG_DRAFT_MIN)
--spec-ngram-size-n N the argument has been removed. use the respective
--spec-ngram--size-n or --spec-ngram-mod-n-match
--spec-ngram-size-m N the argument has been removed. use the respective
--spec-ngram--size-m
--spec-ngram-min-hits N the argument has been removed. use the respective
--spec-ngram-*-min-hits
----- example-specific params -----
-lcs, --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by
generation)
-lcd, --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by
generation)
-ctxcp, --ctx-checkpoints, --swa-checkpoints N
max number of context checkpoints to create per slot (default:
32)(more info)
(env: LLAMA_ARG_CTX_CHECKPOINTS)
-cms, --checkpoint-min-step N minimum spacing between context checkpoints in tokens (default: 256, 0
= no minimum)
(env: LLAMA_ARG_CHECKPOINT_MIN_SPACING_NT)
-cram, --cache-ram N set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
disable)(more
info)
(env: LLAMA_ARG_CACHE_RAM)
-kvu, --kv-unified, -no-kvu, --no-kv-unified
use single unified KV buffer shared across all sequences (default:
enabled if number of slots is auto)
(env: LLAMA_ARG_KV_UNIFIED)
--cache-idle-slots, --no-cache-idle-slots
save and clear idle slots on new task (default: enabled, requires
unified KV and cache-ram)
(env: LLAMA_ARG_CACHE_IDLE_SLOTS)
--context-shift, --no-context-shift whether to use context shift on infinite text generation (default:
disabled)
(env: LLAMA_ARG_CONTEXT_SHIFT)
-r, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode
-sp, --special special tokens output enabled (default: false)
--warmup, --no-warmup whether to perform warmup with an empty run (default: enabled)
--spm-infill use Suffix/Prefix/Middle pattern for infill (instead of
Prefix/Suffix/Middle) as some models prefer this. (default: disabled)
--pooling {none,mean,cls,last,rank} pooling type for embeddings, use model default if unspecified
(env: LLAMA_ARG_POOLING)
-np, --parallel N number of server slots (default: -1, -1 = auto)
(env: LLAMA_ARG_N_PARALLEL)
-cb, --cont-batching, -nocb, --no-cont-batching
whether to enable continuous batching (a.k.a dynamic batching)
(default: enabled)
(env: LLAMA_ARG_CONT_BATCHING)
-mm, --mmproj FILE path to a multimodal projector file. see tools/mtmd/README.md
note: if -hf is used, this argument can be omitted
(env: LLAMA_ARG_MMPROJ)
-mmu, --mmproj-url URL URL to a multimodal projector file. see tools/mtmd/README.md
(env: LLAMA_ARG_MMPROJ_URL)
--mmproj-auto, --no-mmproj, --no-mmproj-auto
whether to use multimodal projector file (if available), useful when
using -hf (default: enabled)
(env: LLAMA_ARG_MMPROJ_AUTO)
--mmproj-offload, --no-mmproj-offload whether to enable GPU offloading for multimodal projector (default:
enabled)
(env: LLAMA_ARG_MMPROJ_OFFLOAD)
--image-min-tokens N minimum number of tokens each image can take, only used by vision
models with dynamic resolution (default: read from model)
(env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N maximum number of tokens each image can take, only used by vision
models with dynamic resolution (default: read from model)
(env: LLAMA_ARG_IMAGE_MAX_TOKENS)
-a, --alias STRING set model name aliases, comma-separated (to be used by API)
(env: LLAMA_ARG_ALIAS)
--tags STRING set model tags, comma-separated (informational, not used for routing)
(env: LLAMA_ARG_TAGS)
--embd-normalize N normalisation for embeddings (default: 2) (-1=none, 0=max absolute
int16, 1=taxicab, 2=euclidean, >2=p-norm)
--host HOST ip address to listen, or bind to an UNIX socket if the address ends
with .sock (default: 127.0.0.1)
(env: LLAMA_ARG_HOST)
--port PORT port to listen (default: 8080)
(env: LLAMA_ARG_PORT)
--reuse-port allow multiple sockets to bind to the same port (default: disabled)
(env: LLAMA_ARG_REUSE_PORT)
--path PATH path to serve static files from (default: )
(env: LLAMA_ARG_STATIC_PATH)
--api-prefix PREFIX prefix path the server serves from, without the trailing slash
(default: )
(env: LLAMA_ARG_API_PREFIX)
--webui-config JSON [DEPRECATED: use --ui-config] JSON that provides default WebUI
settings (overrides WebUI defaults)
(env: LLAMA_ARG_WEBUI_CONFIG)
--ui-config JSON JSON that provides default UI settings (overrides UI defaults)
(env: LLAMA_ARG_UI_CONFIG)
--webui-config-file PATH [DEPRECATED: use --ui-config-file] JSON file that provides default
WebUI settings (overrides WebUI defaults)
(env: LLAMA_ARG_WEBUI_CONFIG_FILE)
--ui-config-file PATH JSON file that provides default UI settings (overrides UI defaults)
(env: LLAMA_ARG_UI_CONFIG_FILE)
--webui-mcp-proxy, --no-webui-mcp-proxy
[DEPRECATED: use --ui-mcp-proxy/--no-ui-mcp-proxy] experimental:
whether to enable MCP CORS proxy
(env: LLAMA_ARG_WEBUI_MCP_PROXY)
--ui-mcp-proxy, --no-ui-mcp-proxy experimental: whether to enable MCP CORS proxy - do not enable in
untrusted environments (default: disabled)
(env: LLAMA_ARG_UI_MCP_PROXY)
--tools TOOL1,TOOL2,... experimental: whether to enable built-in tools for AI agents - do not
enable in untrusted environments (default: no tools)
specify "all" to enable all tools
available tools: read_file, file_glob_search, grep_search,
exec_shell_command, write_file, edit_file, apply_diff, get_datetime
(env: LLAMA_ARG_TOOLS)
--webui, --no-webui [DEPRECATED: use --ui/--no-ui] whether to enable the Web UI
(env: LLAMA_ARG_WEBUI)
--ui, --no-ui whether to enable the Web UI (default: enabled)
(env: LLAMA_ARG_UI)
--embedding, --embeddings restrict to only support embedding use case; use only with dedicated
embedding models (default: disabled)
(env: LLAMA_ARG_EMBEDDINGS)
--rerank, --reranking enable reranking endpoint on server (default: disabled)
(env: LLAMA_ARG_RERANKING)
--api-key KEY API key to use for authentication, multiple keys can be provided as a
comma-separated list (default: none)
(env: LLAMA_API_KEY)
--api-key-file FNAME path to file containing API keys (default: none)
(env: LLAMA_ARG_API_KEY_FILE)
--ssl-key-file FNAME path to file a PEM-encoded SSL private key
(env: LLAMA_ARG_SSL_KEY_FILE)
--ssl-cert-file FNAME path to file a PEM-encoded SSL certificate
(env: LLAMA_ARG_SSL_CERT_FILE)
--chat-template-kwargs STRING sets additional params for the json template parser, must be a valid
json object string, e.g. '{"key1":"value1","key2":"value2"}'
(env: LLAMA_ARG_CHAT_TEMPLATE_KWARGS)
-to, --timeout N server read/write timeout in seconds (default: 3600)
(env: LLAMA_ARG_TIMEOUT)
--threads-http N number of threads used to process HTTP requests (default: -1)
(env: LLAMA_ARG_THREADS_HTTP)
--cache-prompt, --no-cache-prompt whether to enable prompt caching (default: enabled)
(env: LLAMA_ARG_CACHE_PROMPT)
--cache-reuse N min chunk size to attempt reusing from the cache via KV shifting,
requires prompt caching to be enabled (default: 0)
(card)
(env: LLAMA_ARG_CACHE_REUSE)
--metrics enable prometheus compatible metrics endpoint (default: disabled)
(env: LLAMA_ARG_ENDPOINT_METRICS)
--props enable changing global properties via POST /props (default: disabled)
(env: LLAMA_ARG_ENDPOINT_PROPS)
--slots, --no-slots expose slots monitoring endpoint (default: enabled)
(env: LLAMA_ARG_ENDPOINT_SLOTS)
--slot-save-path PATH path to save slot kv cache (default: disabled)
--media-path PATH directory for loading local media files; files can be accessed via
file:// URLs using relative paths (default: disabled)
--models-dir PATH directory containing models for the router server (default: disabled)
(env: LLAMA_ARG_MODELS_DIR)
--models-preset PATH path to INI file containing model presets for the router server
(default: disabled)
(env: LLAMA_ARG_MODELS_PRESET)
--models-max N for router server, maximum number of models to load simultaneously
(default: 4, 0 = unlimited)
(env: LLAMA_ARG_MODELS_MAX)
--models-autoload, --no-models-autoload
for router server, whether to automatically load models (default:
enabled)
(env: LLAMA_ARG_MODELS_AUTOLOAD)
--jinja, --no-jinja whether to use jinja template engine for chat (default: enabled)
(env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT controls whether thought tags are allowed and/or extracted from the
response, and in which format they're returned; one of:
- none: leaves thoughts unparsed in
message.content- deepseek: puts thoughts in
message.reasoning_content- deepseek-legacy: keeps
<think>tags inmessage.contentwhilealso populating
message.reasoning_content(default: auto)
(env: LLAMA_ARG_THINK)
-rea, --reasoning [on|off|auto] Use reasoning/thinking in the chat ('on', 'off', or 'auto', default:
'auto' (detect from template))
(env: LLAMA_ARG_REASONING)
--reasoning-budget N token budget for thinking: -1 for unrestricted, 0 for immediate end,
N>0 for token budget (default: -1)
(env: LLAMA_ARG_THINK_BUDGET)
--reasoning-budget-message MESSAGE message injected before the end-of-thinking tag when reasoning budget
is exhausted (default: none)
(env: LLAMA_ARG_THINK_BUDGET_MESSAGE)
--chat-template JINJA_TEMPLATE set custom jinja chat template (default: template taken from model's
metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set
before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
granite-4.0, granite-4.1, grok-2, hunyuan-dense, hunyuan-moe,
hunyuan-vl, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch,
openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss,
smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
set custom jinja chat template file (default: template taken from
model's metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set
before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
granite-4.0, granite-4.1, grok-2, hunyuan-dense, hunyuan-moe,
hunyuan-vl, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch,
openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss,
smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--skip-chat-parsing, --no-skip-chat-parsing
force a pure content parser, even if a Jinja template is specified;
model will output everything in the content section, including any
reasoning and/or tool calls (default: disabled)
(env: LLAMA_ARG_SKIP_CHAT_PARSING)
--prefill-assistant, --no-prefill-assistant
whether to prefill the assistant's response if the last message is an
assistant message (default: prefill enabled)
when this flag is set, if the last message is an assistant message
then it will be treated as a full message and not prefilled
-sps, --slot-prompt-similarity SIMILARITY
how much the prompt of a request must match the prompt of a slot in
order to use that slot (default: 0.10, 0.0 = disabled)
--lora-init-without-apply load LoRA adapters without applying them (apply later via POST
/lora-adapters) (default: disabled)
--sleep-idle-seconds SECONDS number of seconds of idleness after which the server will sleep
(default: -1; -1 = disabled)
-mv, --model-vocoder FNAME vocoder model for audio generation (default: unused)
--tts-use-guide-tokens Use guide tokens to improve TTS word recall
--embd-gemma-default use default EmbeddingGemma model (note: can download weights from the
internet)
--fim-qwen-1.5b-default use default Qwen 2.5 Coder 1.5B (note: can download weights from the
internet)
--fim-qwen-3b-default use default Qwen 2.5 Coder 3B (note: can download weights from the
internet)
--fim-qwen-7b-default use default Qwen 2.5 Coder 7B (note: can download weights from the
internet)
--fim-qwen-7b-spec use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can
download weights from the internet)
--fim-qwen-14b-spec use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note:
can download weights from the internet)
--fim-qwen-30b-default use default Qwen 3 Coder 30B A3B Instruct (note: can download weights
from the internet)
--gpt-oss-20b-default use gpt-oss-20b (note: can download weights from the internet)
--gpt-oss-120b-default use gpt-oss-120b (note: can download weights from the internet)
--vision-gemma-4b-default use Gemma 3 4B QAT (note: can download weights from the internet)
--vision-gemma-12b-default use Gemma 3 12B QAT (note: can download weights from the internet)
--spec-default enable default speculative decoding config\n\nFixes #23931