-
Notifications
You must be signed in to change notification settings - Fork 14.5k
Create a C-style API similar to whisper.cpp #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fb6a512 to
bb0600c
Compare
|
In my fork I added this struct to bundle up all the relevant data: struct llama_state {
gpt_vocab vocab;
llama_model model;
struct {
int64_t t_load_us = -1;
int64_t t_sample_us = -1;
int64_t t_predict_us = -1;
} timing;
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a step in the right direction, but the exposed things are not the right one.
The llama_layer and llama_model should not be publicly visible.
You have to wrap them in llama_context or llama_state, which is just forward declared in llama.h file and defined in llama.cpp.
See the whisper.cpp C-style API for doing it the correct way:
https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h
If you give it another try, make sure to start from latest master since things are changing there.
e463b4f to
3a561bb
Compare
|
@ggerganov I have made the changes. Please let me know what you think |
c9904e5 to
6ff3e64
Compare
CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llamalib already contains llama.cpp, utils.cpp and utils.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing llama.h, utils.cpp and utils.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated CMakelists.
j-f1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some feedback on the API:
llama.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does llama_init_context_with_prompt take a llama_context& while llama_init_from_params returns an llama_context*? Can you make these have a similar API, or rename them to clarify how they differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this confusion in my second pass of refactoring. I feel it is a lot cleaner now. Please take a look.
6ff3e64 to
bd4476d
Compare
|
@j-f1 @Green-Sky @ggerganov I have done another pass at refactoring and also fixed a few logical bugs that left interactive mode broken in my original version (among other things). I have verified that interactive mode now works as intended and inference remains just as fast as before. I have also rebased on to the latest master branch. Please take another look. Thanks! |
41b6af6 to
71f75c1
Compare
|
@thomasantony For now, leave it like this and let me apply the necessary changes on top of yours to demonstrate what I have in mind - probably tomorrow or the day after. |
Okay. Thanks. In the meantime, I will rebase the new changes on the master branch on to this branch. |
f609ff4 to
5a5d552
Compare
1cb574c to
f0aea33
Compare
- Also single token converter
- executable is now "main" and library is "llama"
f0aea33 to
5195fed
Compare
|
Superseded by #370 |
Update README.md: formate output samples
* Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal --------- Co-authored-by: Iwan Kawrakow <[email protected]>
This change makes it easier to use this code as a library - say to build python bindings on top of it. It extracts out the following functions into
llama.cppllama_model_loadllama_evalllama_model_quantizeIt also moves the relevant struct definitions to
llama.h. This for example, helps avoid redefinition ofllama_hparamsinquantize.cpp. Please let me know if you have any suggestions to improve this.See here for an example of this library structure in use.