Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Aug 20, 2024

Motivation

When deploying to HF inference endpoint, we only have control over the environment variables that can be passed to docker. That's why currently we need to build a custom container and specify these variables via LLAMACPP_ARGS (ref: #9041)

This PR add some server-related arguments to environment variables (see a full list in server/README.md)

Variables are being prefixed LLAMA_ARG_ to distinguish them from compile-time variables like LLAMA_CURL.

Example

LLAMA_ARG_MODEL=../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf LLAMA_ARG_CTX_SIZE=1024 LLAMA_ARG_N_PARALLEL=2 LLAMA_ARG_ENDPOINT_METRICS=1 ./llama-server

In case the same variable is specified in both env and arg, we prioritize env variable:

LLAMA_ARG_MODEL=my_model.gguf ./llama-server -m another_model.gguf
# Expected behavior: we load my_model.gguf
# (in other words, "-m another_model.gguf" is ignored)

On HF infrefence endpoint, these variables can be set from "Settings" tab. (In near future, these variable will be exposed as pre-defined input fields in the UI)

image

@ngxson ngxson merged commit fc54ef0 into master Aug 21, 2024
@ngxson ngxson mentioned this pull request Sep 5, 2024
7 tasks
@ngxson ngxson deleted the xsn/server_env_var branch September 10, 2024 20:47
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…rg#9105)

* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…rg#9105)

* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
…rg#9105)

* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
…rg#9105)

* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
server : handle models with missing EOS token (ggml-org#8997)

server : fix segfault on long system prompt (ggml-org#8987)
* server : fix segfault on long system prompt
* server : fix parallel generation with very small batch sizes
* server : fix typo in comment

server : init stop and error fields of the result struct (ggml-org#9026)

server : fix duplicated n_predict key in the generation_settings (ggml-org#8994)

server : support reading arguments from environment variables (ggml-org#9105)
* server : support reading arguments from environment variables
* add -fa and -dt
* readme : specify non-arg env var

server : add some missing env variables (ggml-org#9116)
* server : add some missing env variables
* add LLAMA_ARG_HOST to server dockerfile
* also add LLAMA_ARG_CONT_BATCHING

Credits are to the respective authors.
Not a single merge conflict occurred.
Compiled, then tested without bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants