-
Notifications
You must be signed in to change notification settings - Fork 11.5k
sampling : refactor init to use llama_sampling_params #3696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
88a17d4
to
bbc0b7c
Compare
bbc0b7c
to
56ba00b
Compare
ggerganov
added a commit
that referenced
this pull request
Oct 22, 2023
ggerganov
added a commit
that referenced
this pull request
Oct 22, 2023
* implementing parallel decoding in server example * crash fixed * save dev progress * refactored sampling function * completion endpoint working * multiple client support * grammar + no stream completion * cached prompt support * chat.mjs support cached prompt + some fixes * server ui now support multiple clients * unused change reverted * fixed timings per slot * add context swap * add changes to README.md * llava multimodal integration * fixed tokens probs * add multimodal input - alfa * refactor code + remove unused comments + improved README.md * fix compilation errors with llvm * notify the user from server ui that multimodality is unavialable * some ci fixes * fix ci make build undefined ref errors * fix long prompt than ctx proposed in #3639 * fixed premature end due stop word * context shift fixed * fix llava implementation * sync README.md changes * readme change * update api like OpenAI * multimodal support enabled by default * fix make bui;d errors * fix multiple clients * fix zig build * new sampling API * latest changes of sampling API * server : coding-style normalization * server : coding-style normalization (part 2) * server : remove beam-search functionality * server : bug fix in ingest_images n_tokens is incremented internally by llama_batch_add * server : use refs + use llama_batch_clear() * server : snake case * server : minor sync * added thread safe pipeline * server : bach has to be allocated for n_parallel sequences * server : no need for atomic int - already using mutex * server : logs + minor code style * server : fix multibyte handle in partial response (#3706) * fix image load + view image in chat * make : silence stb warnings * clip : link to ggml, not to llama * server : fix switch fallthrough * server : fix crash in Debug on macOS (I have no idea why this fixes it!?) * server : refactor ctx_sampling init + n_ctx + names * server : bug fix for prompt caching * Do not save/load image_data to localStorage * editorconfig : new line in index.html * server : completion requests remember slot_id * Update readme to document multimodal in server * server : minor style * Update readme to document multimodal in server * server : hide ctx_sampling->prev behind API (#3696) * server : apply fix from #3722 * server : fix slot reuse * server : add comment about changing slot_state to bool --------- Co-authored-by: FSSRepo <[email protected]> Co-authored-by: Damian Stewart <[email protected]> Co-authored-by: Steward Garcia <[email protected]> Co-authored-by: Jhen-Jie Hong <[email protected]> Co-authored-by: M. Yusuf Sarıgöz <[email protected]>
getnamo
added a commit
to getnamo/Llama-Unreal
that referenced
this pull request
Dec 24, 2023
- repetition penalty update - ggml-org/llama.cpp#3696 - get model wrappers - temperature -> temp - threading set separately - still using eval for now
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change
llama_sampling_init()
Step towards integrating the
common/sampling
intollama.cpp
Change repetition penalty API
This is more efficient since the repetition, frequency and presence penalties are applied in one go instead of 2
Sampling context keeps fewer previous tokens
The prev array used to have a size of
n_ctx
. Since it is used just for applying repetition penalties and checking for antiprompts, it does not have to be larger thanparams.penalty.last_n
(a.k.a.repeat_n_last
)Delete obsolete examples
embd-input
gptneox-wip
Other changes
grammar
string fromgpt2_params
tollama_sampling_params
prev
vector inllama_sampling_context
behind API calls for future-proofnessserver
patch from Fixed segfault when prompt was too long #3661 (bbc0b7c)🤖 Generated by Copilot at 56ba00b
Summary
🗑️🛠️✨
This pull request refactors the sampling module of the llama library to use a new struct for the sampling parameters, to simplify the interface and the penalty logic, and to make the grammar rules optional. It also removes the outdated and incompatible
embd-input
example and its related files.Walkthrough
embd-input
example and its related files, as it was no longer needed (link, link, link, link, link, link, link, link, link, link, link, link, link, link)llama_sampling_params
andgpt_params
structs, and updated the functions that used them, to improve clarity and consistency (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)common/sampling.cpp
andcommon/sampling.h
files, to provide more functionality and flexibility for the sampling process (link, link, link)common/sampling.cpp
andexamples/CMakeLists.txt
files, as minor stylistic changes (link, link, link, link)