Skip to content

sampling : refactor init to use llama_sampling_params #3696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 20, 2023

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 20, 2023

Change llama_sampling_init()

Step towards integrating the common/sampling into llama.cpp

// old
struct llama_sampling_context * llama_sampling_init(const struct gpt_params & params);

// new
struct llama_sampling_context * llama_sampling_init(const struct llama_sampling_params & params);

Change repetition penalty API

This is more efficient since the repetition, frequency and presence penalties are applied in one go instead of 2

// old
    LLAMA_API void llama_sample_repetition_penalty(
            struct llama_context * ctx,
          llama_token_data_array * candidates,
               const llama_token * last_tokens,
                          size_t   last_tokens_size,
                          float    penalty);

    LLAMA_API void llama_sample_frequency_and_presence_penalties(
             struct llama_context * ctx,
           llama_token_data_array * candidates,
                const llama_token * last_tokens,
                           size_t   last_tokens_size,
                            float   alpha_frequency,
                            float   alpha_presence);

// new
    LLAMA_API void llama_sample_repetition_penalties(
             struct llama_context * ctx,
           llama_token_data_array * candidates,
                const llama_token * last_tokens,
                           size_t   penalty_last_n,
                            float   penalty_repeat,
                            float   penalty_freq,
                            float   penalty_present);

Sampling context keeps fewer previous tokens

The prev array used to have a size of n_ctx. Since it is used just for applying repetition penalties and checking for antiprompts, it does not have to be larger than params.penalty.last_n (a.k.a. repeat_n_last)

Delete obsolete examples

  • embd-input
  • gptneox-wip

Other changes

  • Moved the grammar string from gpt2_params to llama_sampling_params
  • Renamed repetition, frequency and presence penalty params in the code. API names are unchanged
  • Hided the prev vector in llama_sampling_context behind API calls for future-proofness
  • Applied server patch from Fixed segfault when prompt was too long #3661 (bbc0b7c)
  • OCD indentations and renaming as usual

🤖 Generated by Copilot at 56ba00b

Summary

🗑️🛠️✨

This pull request refactors the sampling module of the llama library to use a new struct for the sampling parameters, to simplify the interface and the penalty logic, and to make the grammar rules optional. It also removes the outdated and incompatible embd-input example and its related files.

We've cleaned up the code and made it run faster
We've dropped embd-input and sorted the examples
We've refactored the sampling and the gpt_params
So heave away, me hearties, heave away on three

Walkthrough

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Oct 20, 2023
@ggerganov ggerganov merged commit d1031cf into master Oct 20, 2023
ggerganov added a commit that referenced this pull request Oct 22, 2023
* implementing parallel decoding in server example

* crash fixed

* save dev progress

* refactored sampling function

* completion endpoint working

* multiple client support

* grammar + no stream completion

* cached prompt support

* chat.mjs support cached prompt + some fixes

* server ui now support multiple clients

* unused change reverted

* fixed timings per slot

* add context swap

* add changes to README.md

* llava multimodal integration

* fixed tokens probs

* add multimodal input - alfa

* refactor code + remove unused comments + improved README.md

* fix compilation errors with llvm

* notify the user from server ui that multimodality is unavialable

* some ci fixes

* fix ci make build undefined ref errors

* fix long prompt than ctx proposed in #3639

* fixed premature end due stop word

* context shift fixed

* fix llava implementation

* sync README.md changes

* readme change

* update api like OpenAI

* multimodal support enabled by default

* fix make bui;d errors

* fix multiple clients

* fix zig build

* new sampling API

* latest changes of sampling API

* server : coding-style normalization

* server : coding-style normalization (part 2)

* server : remove beam-search functionality

* server : bug fix in ingest_images

n_tokens is incremented internally by llama_batch_add

* server : use refs + use llama_batch_clear()

* server : snake case

* server : minor sync

* added thread safe pipeline

* server : bach has to be allocated for n_parallel sequences

* server : no need for atomic int - already using mutex

* server : logs + minor code style

* server : fix multibyte handle in partial response (#3706)

* fix image load + view image in chat

* make : silence stb warnings

* clip : link to ggml, not to llama

* server : fix switch fallthrough

* server : fix crash in Debug on macOS (I have no idea why this fixes it!?)

* server : refactor ctx_sampling init + n_ctx + names

* server : bug fix for prompt caching

* Do not save/load image_data to localStorage

* editorconfig : new line in index.html

* server : completion requests remember slot_id

* Update readme to document multimodal in server

* server : minor style

* Update readme to document multimodal in server

* server : hide ctx_sampling->prev behind API (#3696)

* server : apply fix from #3722

* server : fix slot reuse

* server : add comment about changing slot_state to bool

---------

Co-authored-by: FSSRepo <[email protected]>
Co-authored-by: Damian Stewart <[email protected]>
Co-authored-by: Steward Garcia <[email protected]>
Co-authored-by: Jhen-Jie Hong <[email protected]>
Co-authored-by: M. Yusuf Sarıgöz <[email protected]>
getnamo added a commit to getnamo/Llama-Unreal that referenced this pull request Dec 24, 2023
- repetition penalty update - ggml-org/llama.cpp#3696
- get model wrappers
- temperature -> temp
- threading set separately
- still using eval for now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant