Skip to content

Bug: Difficulties Using LLaMa.cpp Server and --prompt-cache [FNAME] (not supported?) #9135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
darien-schettler opened this issue Aug 22, 2024 · 16 comments
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale

Comments

@darien-schettler
Copy link

darien-schettler commented Aug 22, 2024

What happened?

As seen here:

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

The llama.cpp server should support --prompt-cache [FNAME]

I have not been able to get this feature to work.
I have tried workarounds such as using llama-cli to generate the prompt-cache and then specify this file for llama-server.

Is there some minimally reproducible code snippet that shows this feature working? Is it implemented?

Thanks in advance.

Name and Version

CLI Call to generate prompt cache.

version: 3613 (fc54ef0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

$ ./llama-cli -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin --prompt-cache-all --file "/.../prompt_files/pirate_prompt.txt"

Server Call (after generating prompt_cache.bin with llama-cli)(this prompt file is the same as the above without the final user input which will be sent via the request).

version: 3613 (fc54ef0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

$ ./llama-server -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" --host 0.0.0.0 --port 8080 -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin" --prompt-cache-ro --keep -1 -f "/.../prompt_files/pirate_prompt_server.txt"

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

@darien-schettler darien-schettler added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Aug 22, 2024
@ggerganov
Copy link
Member

The argument is ignored by llama-server. It would be nice to implement, but it's not very clear how since it has to consider multiple parallel slots. Or at the very least, assert that -np 1

@ngxson
Copy link
Collaborator

ngxson commented Aug 23, 2024

The --prompt-cache is not directly supported by server. You can use prompt caching with /slots endpoint. It allows you to save and load the KV cache for each slot.

See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file

@darien-schettler
Copy link
Author

darien-schettler commented Aug 23, 2024

Ah I see. So the documentation I linked to is just out-of-date (or it is correct in saying the argument is supported but the functionality is not)

I can update that if you’d like. Thank you for clarifying.

As a followup, on a powerful CPU only machine, if I have a pipeline of 3-4 steps (each requiring different system prompts and few shot examples (message history)), is there a way to cache all of those ahead of time?

This pipeline is run E2E as a single call by many users but it’s not a continuous conversation.

I don’t want to pay the prompt processing latency for each step (as it takes an inordinate amount of time even on a powerful machine. Ideally I have one model with the ability to determine dynamically which cached prompt is the best match and go from there.

If that’s not possible, do you have any suggestions for workarounds that minimize latency?

@darien-schettler
Copy link
Author

The --prompt-cache is not directly supported by server. You can use prompt caching with /slots endpoint. It allows you to save and load the KV cache for each slot.

See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file

I will experiment with this today and see if I can make this work. I will report back if you’d like?

@ngxson
Copy link
Collaborator

ngxson commented Aug 23, 2024

In this case, you don't even need to write the prompt cache to disk. You can use cache_prompt option:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "You are you"}
  ],
  "cache_prompt": true
}

The first time, all tokens will be processed and kept in cache (so it takes time).

The second time, it will reused cached tokens:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "This is another question"}
  ],
  "cache_prompt": true
}

This time, only This is another question will be processed. The system prompt is cached.

@darien-schettler
Copy link
Author

In this case, you don't even need to write the prompt cache to disk. You can use cache_prompt option:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "You are you"}
  ],
  "cache_prompt": true
}

The first time, all tokens will be processed and kept in cache (so it takes time).

The second time, it will reused cached tokens:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant. You task is......."},
    {"role": "user", "content": "This is another question"}
  ],
  "cache_prompt": true
}

This time, only This is another question will be processed. The system prompt is cached.

I will experiment with this too. Thanks in advance.

If any of the solutions above fixes things I will close this issue with a comment detailing my experience.

@Devbrat-Dev
Copy link

The --prompt-cache is not directly supported by server. You can use prompt caching with /slots endpoint. It allows you to save and load the KV cache for each slot.

See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file

This works for me, as my system only has a CPU, and processing longer prompts takes too much time.

I start the llama-server with the --slot-save-path PATH option to specify the path for saving the slot KV cache.

Before terminating the llama-server, I save the KV cache by sending an API request using curl:

curl --request POST \
  --url 'http://[IP]:[PORT]/slots/0?action=save' \
  --header 'Content-Type: application/json' \
  --data '{
    "filename": "cache.bin"
}'

To restore the KV cache when starting llama-server next time, send an API request using curl:

curl --request POST \
  --url 'http://[IP]:[PORT]/slots/0?action=restore' \
  --header 'Content-Type: application/json' \
  --data '{
    "filename": "cache.bin"
}'

@darien-schettler
Copy link
Author

Hi all, I've been able to get the initial setup working by simply relying on cache prompt=True and triggering all 20 known prompts (for the various tools/endpoints).

@Devbrat-Dev - I will try this method you mentioned and report back. After that I will close this comment.

Thanks for all the support!

@darien-schettler
Copy link
Author

darien-schettler commented Sep 5, 2024

EDIT: It appears I needed to create the cache myself first. I found the error in the logging output (missed it initially due to the verbosity flag I put through overwhelming it). To fix I just did touch [PATH TO INCLUSIVE OF FILENAME cache.bin] before starting the server. It works now! Thanks! Closing this issue.


@Devbrat-Dev - I tried this but unfortunately I'm getting n_written: 0 ... and the file is not being created.

server launch:

 ./llama-server -m [FILEPATH] --verbose --host [HOST] --port [PORT] -c 8192 --mlock -t $(nproc) --slot-save-path [path/like/this/to/dir/]

chat message

  • has a system message and a chat history (system message + few shot prompt)
  • Includes new chat question
  • send cache_prompt as extra_body parameter
  • returns assistant response after long processing time
  • Runs much faster if issuing similar followups that use the same base (system+few-shot)

request:

curl --request POST \
  --url 'http://[IP]:[PORT]/slots/0?action=restore' \
  --header 'Content-Type: application/json' \
  --data '{
    "filename": "cache.bin"
}'

response:

{'id_slot': 0,
 'filename': 'cache.bin',
 'n_saved': 243,
 'n_written': 0,
 'timings': {'save_ms': 0.068}}

The n_saved updates... but unfortunately the file isn't created or being added to. My prompt requests are using the openai compatible server endpoint and I'm passing cache_prompt: true as an extra_body parameter.

@darien-schettler
Copy link
Author

Thanks for the support. The tl;dr is that currently the argument is ignored by llama-server.

That said, you can make a work around by using cache_prompt in the request to the server paired with --slot-save-path when starting the server, and then leveraging the post requests to the /slots/ endpoint to save, restore, delete the slots kv cache.

With only 1 slot this should be equivalent to what I was trying to achieve initially.

Thanks for the support everyone.

@dhandhalyabhavik
Copy link

The --prompt-cache is not directly supported by server. You can use prompt caching with /slots endpoint. It allows you to save and load the KV cache for each slot.

See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file

Hey @ngxson

Please provide an example of kv caching working with slots.
For me REST queries are executing without any issue but after restoration when I am checking content of slots using /slots GET call, I can see only the last executed prompt. I don't know what is happening with restoration. Technically, it should replace that old prompt with the .bin file's cached prompt.

@darien-schettler
Copy link
Author

Reopening as I am encountering issues with this method.

When attempting to serialize all of the starting prompts the slots fill up and begin to remove older content. I didn’t notice this initially.

I am ok with having a massive file on disk (and a slightly slower load) but I need everything to be persistent (all interactions should be cached and checked against).

Any help would be greatly appreciated.

@darien-schettler
Copy link
Author

Basically, is there any way to do the following:

(1) Cache an arbitrarily large number of system prompts (dozens if not hundreds of 1-15k token long prompts) that can be searched and restored with minimal latency (i.e. this can be done w/ llama-cli and --prompt-cache)

(2) Update this cache with any/all inputs to the server (all user inputs should be shared and cached and appended to existing system prompts ... if that makes sense). Every interaction a user has with our LLM(s) are stateless and rigidly defined and as such we want to be able to share between all historical conversations.

(3) Be able to save/load the cache from disk


(1) and (3) are critical and (2) is something we need sooner or later. I feel like this is all possible with llama-cpp so it is simply an implementation issue. I'm open to workarounds as well (using llama-cpp-server), or if it's possible without too much effort to make our own server that wraps llama-cli functionality (w/o introducing overhead) I'd be open to that to.


What is the recommended path forward for this situation?

@dhandhalyabhavik
Copy link

I have solved case 1 and 3 using slot saving and restoring. Let's say you have 100 long prompts. You can cache them using slot save as .bin files, L1.bin for first long prompt, L2.bin for second and so on.

If you want to communicate to L1.bin, just restore them on slot 0/1 (whichever is available) and continue sending your queries.

@darien-schettler
Copy link
Author

@dhandhalyabhavik - I appreciate the response. Quick question.

Would your method not require knowing ahead of time what prompt comes next and then loading that slot cache ahead of the LLM call?

While this would likely be faster than processing the whole prompt it adds a manual step (new logic) in where we have to load the cache into a slot and additionally the slot loading itself takes some amount of time.

Regardless of above, this sounds like a possible (non-ideal) workaround. I’ll investigate further.

🙏 Thank you!

@github-actions github-actions bot added the stale label Nov 17, 2024
Copy link
Contributor

github-actions bot commented Dec 1, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale
Projects
None yet
Development

No branches or pull requests

5 participants