Create a C-style API similar to whisper.cpp #77

thomasantony · 2023-03-13T03:04:55Z

This change makes it easier to use this code as a library - say to build python bindings on top of it. It extracts out the following functions into llama.cpp

llama_model_load
llama_eval
llama_model_quantize

It also moves the relevant struct definitions to llama.h. This for example, helps avoid redefinition of llama_hparams in quantize.cpp. Please let me know if you have any suggestions to improve this.

See here for an example of this library structure in use.

j-f1 · 2023-03-13T16:12:30Z

In my fork I added this struct to bundle up all the relevant data:

struct llama_state {
    gpt_vocab vocab;
    llama_model model;
    struct {
        int64_t t_load_us = -1;
        int64_t t_sample_us = -1;
        int64_t t_predict_us = -1;
    } timing;
};

ggerganov

Yes, this is a step in the right direction, but the exposed things are not the right one.

The llama_layer and llama_model should not be publicly visible.
You have to wrap them in llama_context or llama_state, which is just forward declared in llama.h file and defined in llama.cpp.

See the whisper.cpp C-style API for doing it the correct way:
https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h

If you give it another try, make sure to start from latest master since things are changing there.

thomasantony · 2023-03-16T03:41:27Z

@ggerganov I have made the changes. Please let me know what you think

Green-Sky · 2023-03-17T00:35:19Z

CMakeLists.txt

llamalib already contains llama.cpp, utils.cpp and utils.h

Green-Sky · 2023-03-17T00:35:40Z

CMakeLists.txt

missing llama.h, utils.cpp and utils.h

Updated CMakelists.

j-f1

Some feedback on the API:

llama.h

j-f1 · 2023-03-17T01:37:03Z

llama.h

Why does llama_init_context_with_prompt take a llama_context& while llama_init_from_params returns an llama_context*? Can you make these have a similar API, or rename them to clarify how they differ?

Removed this confusion in my second pass of refactoring. I feel it is a lot cleaner now. Please take a look.

llama.h

thomasantony · 2023-03-18T02:34:20Z

@j-f1 @Green-Sky @ggerganov I have done another pass at refactoring and also fixed a few logical bugs that left interactive mode broken in my original version (among other things). I have verified that interactive mode now works as intended and inference remains just as fast as before.

I have also rebased on to the latest master branch. Please take another look. Thanks!

CMakeLists.txt

ggerganov · 2023-03-18T17:15:41Z

@thomasantony
We want to have a C-style API in llama.h. We cannot expose C++ constructs

For now, leave it like this and let me apply the necessary changes on top of yours to demonstrate what I have in mind - probably tomorrow or the day after.
Thanks for the contributing!

thomasantony · 2023-03-18T17:48:05Z

@thomasantony We want to have a C-style API in llama.h. We cannot expose C++ constructs

For now, leave it like this and let me apply the necessary changes on top of yours to demonstrate what I have in mind - probably tomorrow or the day after. Thanks for the contributing!

Okay. Thanks. In the meantime, I will rebase the new changes on the master branch on to this branch.

- Also single token converter

- executable is now "main" and library is "llama"

ggerganov · 2023-03-21T20:52:06Z

Superseded by #370

Update README.md: formate output samples

* Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal --------- Co-authored-by: Iwan Kawrakow <[email protected]>

thomasantony force-pushed the refactor_for_library branch from fb6a512 to bb0600c Compare March 13, 2023 04:58

ggerganov requested changes Mar 13, 2023

View reviewed changes

bakkot mentioned this pull request Mar 15, 2023

[Proposal] "Stable" C API #171

Closed

thomasantony force-pushed the refactor_for_library branch 3 times, most recently from e463b4f to 3a561bb Compare March 16, 2023 03:41

thomasantony force-pushed the refactor_for_library branch 3 times, most recently from c9904e5 to 6ff3e64 Compare March 16, 2023 03:51

Green-Sky suggested changes Mar 17, 2023

View reviewed changes

j-f1 reviewed Mar 17, 2023

View reviewed changes

bernatvadell mentioned this pull request Mar 17, 2023

Thanks for contributing to Machine Learning and AI with this repository! #233

Closed

This was referenced Mar 18, 2023

How do I get input embeddings? #224

Closed

Rust Bindings #248

Closed

thomasantony force-pushed the refactor_for_library branch from 6ff3e64 to bd4476d Compare March 18, 2023 02:31

thomasantony force-pushed the refactor_for_library branch 2 times, most recently from 41b6af6 to 71f75c1 Compare March 18, 2023 03:52

Green-Sky reviewed Mar 18, 2023

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

j-f1 mentioned this pull request Mar 18, 2023

How to use it in Python #253

Closed

Green-Sky mentioned this pull request Mar 18, 2023

Refactor most code in main.cpp into a separate module (preparing to implement TCP mode) #267

Closed

ggerganov changed the title ~~Refactors out some of the functions into llama.cpp~~ Create a C-style API similar to whisper.cpp Mar 18, 2023

thomasantony force-pushed the refactor_for_library branch 4 times, most recently from f609ff4 to 5a5d552 Compare March 19, 2023 18:59

thomasantony force-pushed the refactor_for_library branch from 1cb574c to f0aea33 Compare March 19, 2023 20:21

thomasantony added 17 commits March 19, 2023 13:30

Add llama.cpp and llama.h

8f02f1b

Refactor out library code from main.cpp

9c2109e

Move llama_model_quantize() into llama.cpp

ebfc622

Add to CMakeLists.txt

0995df5

Refactor code structure in llama.cpp and llama.h

b14486e

Update main.cpp to use new llama library

4b4d8a5

Add make_unique for C++11

a81b670

Apply suggestions to CMakeLists.txt

7fb4c51

Apply suggestions to llama.cpp and llama.h

b0ed03b

Apply suggestions to main.cpp

912e624

Add llama_tokens_to_string() to utils.cpp

05224ed

- Also single token converter

Refactor llama.cpp and llama.h

3839a08

Refactor interactive mode in main.cpp

660a4d5

Update llama.cpp to use instruct mode

edb52ab

Update main.cpp to use instruct mode

1c1cf35

Update CMakeLists to rename targets

b3541ce

- executable is now "main" and library is "llama"

Add support for multiple antiprompts

5195fed

thomasantony force-pushed the refactor_for_library branch from f0aea33 to 5195fed Compare March 19, 2023 20:39

Update llama_model_load() from master branch

1c545e5

gjmulder added the enhancement New feature or request label Mar 20, 2023

ggerganov mentioned this pull request Mar 20, 2023

Should use mmap for model loading #91

Closed

Green-Sky added the high priority Very important issue label Mar 21, 2023

slaren mentioned this pull request Mar 21, 2023

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode #333

Merged

ggerganov closed this Mar 21, 2023

thomasantony mentioned this pull request Mar 22, 2023

Initial support for llama.cpp oobabooga/text-generation-webui#447

Merged

rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023

Merge pull request ggml-org#77 from madroidmaq/master

366711a

Update README.md: formate output samples

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Create a C-style API similar to whisper.cpp #77

Create a C-style API similar to whisper.cpp #77

Uh oh!

Conversation

thomasantony commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j-f1 commented Mar 13, 2023

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasantony commented Mar 16, 2023

Uh oh!

Green-Sky Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

thomasantony Mar 18, 2023

Choose a reason for hiding this comment

Uh oh!

Green-Sky Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

thomasantony Mar 18, 2023

Choose a reason for hiding this comment

Uh oh!

j-f1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

j-f1 Mar 17, 2023

Choose a reason for hiding this comment

Uh oh!

thomasantony Mar 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasantony commented Mar 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Mar 18, 2023

Uh oh!

thomasantony commented Mar 18, 2023

Uh oh!

ggerganov commented Mar 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thomasantony commented Mar 13, 2023 •

edited

Loading

ggerganov left a comment •

edited

Loading

thomasantony commented Mar 18, 2023 •

edited

Loading