Description
The program does not currently allow for true batch processing when creating embeddings, which would greatly speed up creating embeddings. Upstream allows for batch processing but Kobold doesn't implement it. Here's an outline that's hopefully helpful:
- llama.h allows the batching of multiple sequences of tokens.
- common.cpp adds each token of a prompt to the "llama_batch" and associates each token with a sequence id
- llama_encode is used, see more here context : allow cache-less context for embeddings ggml-org/llama.cpp#13108
- embeddings are retrieved using "llama_get_embeddings_seq" or "llama_get_embeddings_ith"
- "n_batch" within llama.cpp refers to the maximum number of tokens that can be sent to "llama_batch". It is not traditional. For example, with pytorch and sentence transformers each "batch" consists of multiple "sequences" (i.e. chunks of text to be embedded) and each one must not exceed the context limit of a particular embedding model. Further each "batch" is padded to properly create a "tensor." However, with llama.cpp, each "batch" (i.e. a sequence/chunk of text to be embedded) becomes part of a flat token stream and different sequences are identified with a "seq_id" and, therefore, padding is unnecessary. However, this means that all sequences must be within "n_batch" and, more importantly, "n_batch" must never exceed an embedding model's maximum context limit. Again, this is different than pytorch and sentence transformers.
- Historically, embedding models had a maximum context limit of 512 (and many still do), which made true batch processing with llama.cpp unattractive. However, many embedding models now have limits of 8192 or more...
Batch processing would greatly improve embedding large documents.
Within Kobold, embeddingstype_generate() processes a single prompt although it's structured allow future expansion.
Further, koboldcpp.py loops over user inputs and only passes one prompt to the native embeddings_generate() C++ function.
for prompt in prompts:
inputs = embeddings_generation_inputs()
inputs.prompt = prompt.encode()
ret = handle.embeddings_generate(inputs)
It then aggregates results into an openai-style response...
Basically, multiple prompts would be processed one at a time at the python layer, and internally "embeddings_generate" only receives and batches one prompt at a time.