sampling : refactor init to use llama_sampling_params #3696

ggerganov · 2023-10-20T12:00:44Z

Change `llama_sampling_init()`

Step towards integrating the common/sampling into llama.cpp

// old
struct llama_sampling_context * llama_sampling_init(const struct gpt_params & params);

// new
struct llama_sampling_context * llama_sampling_init(const struct llama_sampling_params & params);

Change repetition penalty API

This is more efficient since the repetition, frequency and presence penalties are applied in one go instead of 2

// old
    LLAMA_API void llama_sample_repetition_penalty(
            struct llama_context * ctx,
          llama_token_data_array * candidates,
               const llama_token * last_tokens,
                          size_t   last_tokens_size,
                          float    penalty);

    LLAMA_API void llama_sample_frequency_and_presence_penalties(
             struct llama_context * ctx,
           llama_token_data_array * candidates,
                const llama_token * last_tokens,
                           size_t   last_tokens_size,
                            float   alpha_frequency,
                            float   alpha_presence);

// new
    LLAMA_API void llama_sample_repetition_penalties(
             struct llama_context * ctx,
           llama_token_data_array * candidates,
                const llama_token * last_tokens,
                           size_t   penalty_last_n,
                            float   penalty_repeat,
                            float   penalty_freq,
                            float   penalty_present);

Sampling context keeps fewer previous tokens

The prev array used to have a size of n_ctx. Since it is used just for applying repetition penalties and checking for antiprompts, it does not have to be larger than params.penalty.last_n (a.k.a. repeat_n_last)

Delete obsolete examples

embd-input
gptneox-wip

Other changes

Moved the grammar string from gpt2_params to llama_sampling_params
Renamed repetition, frequency and presence penalty params in the code. API names are unchanged
Hided the prev vector in llama_sampling_context behind API calls for future-proofness
Applied server patch from Fixed segfault when prompt was too long #3661 (bbc0b7c)
OCD indentations and renaming as usual

`🤖 Generated by Copilot at 56ba00b`

Summary

🗑️🛠️✨

This pull request refactors the sampling module of the llama library to use a new struct for the sampling parameters, to simplify the interface and the penalty logic, and to make the grammar rules optional. It also removes the outdated and incompatible embd-input example and its related files.

We've cleaned up the code and made it run faster
We've dropped embd-input and sorted the examples
We've refactored the sampling and the gpt_params
So heave away, me hearties, heave away on three

Walkthrough

Removed the embd-input example and its related files, as it was no longer needed (link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Renamed and moved some fields in the llama_sampling_params and gpt_params structs, and updated the functions that used them, to improve clarity and consistency (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Added new functions and parameters to the common/sampling.cpp and common/sampling.h files, to provide more functionality and flexibility for the sampling process (link, link, link)
Changed some comments and subdirectory order in the common/sampling.cpp and examples/CMakeLists.txt files, as minor stylistic changes (link, link, link, link)

ggml-ci

* implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <[email protected]> Co-authored-by: Damian Stewart <[email protected]> Co-authored-by: Steward Garcia <[email protected]> Co-authored-by: Jhen-Jie Hong <[email protected]> Co-authored-by: M. Yusuf Sarıgöz <[email protected]>

- repetition penalty update - ggml-org/llama.cpp#3696 - get model wrappers - temperature -> temp - threading set separately - still using eval for now

sampling : refactor init to use llama_sampling_params

cd1e937

ggerganov mentioned this pull request Oct 20, 2023

server : parallel decoding and multimodal (cont) #3677

Merged

ggerganov added 4 commits October 20, 2023 17:05

llama : combine repetition, frequency and presence penalties in 1 call

6e65876

examples : remove embd-input and gptneox-wip

84ed48b

sampling : rename penalty params + reduce size of "prev" vector

b526561

sampling : add llama_sampling_print helper

7e2b5fb

ggerganov force-pushed the sampling-refactor branch from 88a17d4 to bbc0b7c Compare October 20, 2023 15:27

sampling : hide prev behind API and apply #3661

56ba00b

ggml-ci

ggerganov force-pushed the sampling-refactor branch from bbc0b7c to 56ba00b Compare October 20, 2023 15:53

ggerganov mentioned this pull request Oct 20, 2023

Fixed segfault when prompt was too long #3661

Closed

ggerganov added the need feedback Testing and feedback with results are needed label Oct 20, 2023

ggerganov merged commit d1031cf into master Oct 20, 2023

z80maniac mentioned this pull request Oct 22, 2023

server : allow to specify custom prompt for penalty calculation #3727

Merged

ggerganov added a commit that referenced this pull request Oct 22, 2023

server : hide ctx_sampling->prev behind API (#3696)

00ae55b

getnamo added a commit to getnamo/Llama-Unreal that referenced this pull request Dec 24, 2023

first pass upgrade

6077e40

- repetition penalty update - ggml-org/llama.cpp#3696 - get model wrappers - temperature -> temp - threading set separately - still using eval for now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling : refactor init to use llama_sampling_params #3696

sampling : refactor init to use llama_sampling_params #3696

ggerganov commented Oct 20, 2023 •

edited by ghost

Loading

sampling : refactor init to use llama_sampling_params #3696

sampling : refactor init to use llama_sampling_params #3696

Conversation

ggerganov commented Oct 20, 2023 • edited by ghost Loading

Change llama_sampling_init()

Change repetition penalty API

Sampling context keeps fewer previous tokens

Delete obsolete examples

Other changes

🤖 Generated by Copilot at 56ba00b

Summary

Walkthrough

ggerganov commented Oct 20, 2023 •

edited by ghost

Loading

Change `llama_sampling_init()`

`🤖 Generated by Copilot at 56ba00b`