-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Bug: Difficulties Using LLaMa.cpp Server and --prompt-cache [FNAME] (not supported?) #9135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The argument is ignored by |
The See usage in the docs: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-slotsid_slotactionsave-save-the-prompt-cache-of-the-specified-slot-to-a-file |
Ah I see. So the documentation I linked to is just out-of-date (or it is correct in saying the argument is supported but the functionality is not) I can update that if you’d like. Thank you for clarifying. As a followup, on a powerful CPU only machine, if I have a pipeline of 3-4 steps (each requiring different system prompts and few shot examples (message history)), is there a way to cache all of those ahead of time? This pipeline is run E2E as a single call by many users but it’s not a continuous conversation. I don’t want to pay the prompt processing latency for each step (as it takes an inordinate amount of time even on a powerful machine. Ideally I have one model with the ability to determine dynamically which cached prompt is the best match and go from there. If that’s not possible, do you have any suggestions for workarounds that minimize latency? |
I will experiment with this today and see if I can make this work. I will report back if you’d like? |
In this case, you don't even need to write the prompt cache to disk. You can use
The first time, all tokens will be processed and kept in cache (so it takes time). The second time, it will reused cached tokens:
This time, only |
I will experiment with this too. Thanks in advance. If any of the solutions above fixes things I will close this issue with a comment detailing my experience. |
This works for me, as my system only has a CPU, and processing longer prompts takes too much time. I start the Before terminating the curl --request POST \
--url 'http://[IP]:[PORT]/slots/0?action=save' \
--header 'Content-Type: application/json' \
--data '{
"filename": "cache.bin"
}' To restore the KV cache when starting curl --request POST \
--url 'http://[IP]:[PORT]/slots/0?action=restore' \
--header 'Content-Type: application/json' \
--data '{
"filename": "cache.bin"
}' |
Hi all, I've been able to get the initial setup working by simply relying on @Devbrat-Dev - I will try this method you mentioned and report back. After that I will close this comment. Thanks for all the support! |
EDIT: It appears I needed to create the cache myself first. I found the error in the logging output (missed it initially due to the verbosity flag I put through overwhelming it). To fix I just did @Devbrat-Dev - I tried this but unfortunately I'm getting server launch:
chat message
request:
response:
The n_saved updates... but unfortunately the file isn't created or being added to. My prompt requests are using the openai compatible server endpoint and I'm passing |
Thanks for the support. The tl;dr is that currently the argument is ignored by llama-server. That said, you can make a work around by using With only 1 slot this should be equivalent to what I was trying to achieve initially. Thanks for the support everyone. |
Hey @ngxson Please provide an example of kv caching working with slots. |
Reopening as I am encountering issues with this method. When attempting to serialize all of the starting prompts the slots fill up and begin to remove older content. I didn’t notice this initially. I am ok with having a massive file on disk (and a slightly slower load) but I need everything to be persistent (all interactions should be cached and checked against). Any help would be greatly appreciated. |
Basically, is there any way to do the following: (1) Cache an arbitrarily large number of system prompts (dozens if not hundreds of 1-15k token long prompts) that can be searched and restored with minimal latency (i.e. this can be done w/ llama-cli and --prompt-cache) (2) Update this cache with any/all inputs to the server (all user inputs should be shared and cached and appended to existing system prompts ... if that makes sense). Every interaction a user has with our LLM(s) are stateless and rigidly defined and as such we want to be able to share between all historical conversations. (3) Be able to save/load the cache from disk (1) and (3) are critical and (2) is something we need sooner or later. I feel like this is all possible with llama-cpp so it is simply an implementation issue. I'm open to workarounds as well (using llama-cpp-server), or if it's possible without too much effort to make our own server that wraps llama-cli functionality (w/o introducing overhead) I'd be open to that to. What is the recommended path forward for this situation? |
I have solved case 1 and 3 using slot saving and restoring. Let's say you have 100 long prompts. You can cache them using slot save as .bin files, L1.bin for first long prompt, L2.bin for second and so on. If you want to communicate to L1.bin, just restore them on slot 0/1 (whichever is available) and continue sending your queries. |
@dhandhalyabhavik - I appreciate the response. Quick question. Would your method not require knowing ahead of time what prompt comes next and then loading that slot cache ahead of the LLM call? While this would likely be faster than processing the whole prompt it adds a manual step (new logic) in where we have to load the cache into a slot and additionally the slot loading itself takes some amount of time. Regardless of above, this sounds like a possible (non-ideal) workaround. I’ll investigate further. 🙏 Thank you! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
As seen here:
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
The llama.cpp server should support --prompt-cache [FNAME]
I have not been able to get this feature to work.
I have tried workarounds such as using llama-cli to generate the prompt-cache and then specify this file for llama-server.
Is there some minimally reproducible code snippet that shows this feature working? Is it implemented?
Thanks in advance.
Name and Version
CLI Call to generate prompt cache.
version: 3613 (fc54ef0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
$ ./llama-cli -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin --prompt-cache-all --file "/.../prompt_files/pirate_prompt.txt"
Server Call (after generating prompt_cache.bin with llama-cli)(this prompt file is the same as the above without the final user input which will be sent via the request).
version: 3613 (fc54ef0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
$ ./llama-server -m "/.../Meta-Llama-3.1-8B-Instruct-Q6_K.gguf" --host 0.0.0.0 --port 8080 -c 4096 --verbose-prompt -co --mlock -t $(nproc) --prompt-cache "/.../prompt_cache/prompt_cache.bin" --prompt-cache-ro --keep -1 -f "/.../prompt_files/pirate_prompt_server.txt"
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
The text was updated successfully, but these errors were encountered: