Skip to content

Server allow /completion and /embedding #3815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
christianwengert opened this issue Oct 27, 2023 · 13 comments · Fixed by #3876
Closed
4 tasks done

Server allow /completion and /embedding #3815

christianwengert opened this issue Oct 27, 2023 · 13 comments · Fixed by #3876
Labels
enhancement New feature or request

Comments

@christianwengert
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

When I start the server as follows:

./server -m wizardlm-70b-v1.0.q4_K_S.gguf --threads 8 -ngl 100  -c 4096 --embedding

and make a request to /embedding

curl --request POST \
    --url http://localhost:8080/embedding \
    --header "Content-Type: application/json" \
    --data '{"content": "Building a website can be done in 10 simple steps:"}'

I get back - as expected the vector of embeddings. Now If I make a request to /completion as follows:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

I'd expect the normal completion to still work. But all I get is the embedding of the prompt (I tested it with above examples, and it is the same vector returned in both examples)

Motivation

I guess having both normal completion and the possibiilty to just get embeddings makes sense in a lot of applications with the server.

@christianwengert christianwengert added the enhancement New feature or request label Oct 27, 2023
@elohmeier
Copy link

This is a regression, it still worked in b1359.

@a-h
Copy link
Contributor

a-h commented Oct 31, 2023

It looks like the issue was introduced in this commit: 438c2ca which references this pull request #3677

Before the change

Previously, the /embedding endpoint in server.cpp did this:

        const json data = format_embedding_response(llama);
        return res.set_content(data.dump(), "application/json"); });

Which went through to:

static json format_embedding_response(llama_server_context &llama)
{
    return json{
        {"embedding", llama.getEmbedding()},
    };
}

Which went to:

std::vector<float> getEmbedding()
    {
        static const int n_embd = llama_n_embd(model);
        if (!params.embedding)
        {
            LOG_WARNING("embedding disabled", {
                                                  {"params.embedding", params.embedding},
                                              });
            return std::vector<float>(n_embd, 0.0f);
        }
        const float *data = llama_get_embeddings(ctx);
        std::vector<float> embedding(data, data + n_embd);
        return embedding;
    }
};

That's what I'd expect, since it matches the behaviour of the command line in calling llama_get_embeddings - https://github.com/ggerganov/llama.cpp/blob/207b51900e15cc7f89763a3bb1c565fe11cbb45d/examples/embedding/embedding.cpp#L91C31-L91C51

After the commit

What it appears to be doing is adding a task to a queue by calling request_completion, but not specifying that the task should be an embedding task.

https://github.com/ggerganov/llama.cpp/blob/07178c98e1b61a5e2af39d347add12e7eb9e08e1/examples/server/server.cpp#L2436-L2438

The task now accidentally makes it so that the server always responds with embeddings if the server is started with the --embedding param:


                // prompt evaluated for embedding
                if (params.embedding)
                {
                    send_embedding(slot);
                    slot.release();
                    slot.i_batch = -1;
                    return true;
                }

What I think should have happened is that:

  • The task_server struct needs an additional field called embedding_mode which is set to true by the /embedding endpoint, and defaults to false by the others.
  • The if (params.embedding) code needs changing to use slot.embedding_mode instead.

Possible accidental removal of feature?

It looks to me that the feature was accidentally removed in the refactor.

@ggerganov
Copy link
Member

Yes, very likely it was accidentally removed.
PRs welcome to restore the behavior

@a-h
Copy link
Contributor

a-h commented Oct 31, 2023

Thanks, the Nix flake made it very easy to get a development environment and build setup together.

@qiisziilbash
Copy link

qiisziilbash commented Feb 26, 2024

Where can I see what embedding model is being used? or how can I pass a different embedding model (different from chat completion one)?
I did read the Readme files to use --embedding flag but could not find answers to my questions.
Thank you

@phymbert
Copy link
Collaborator

phymbert commented Feb 26, 2024

Where can I see what embedding model is being used? or how can I pass a different embedding model? I did read the Readme files to use --embedding flag but could not find answers to my questions. Thank you

The server only supports one model. It is used both for embeddings and completions. If you need two different models, you need two servers. Feel free to open an enhancement issue if you would like to have different models for embeddings and completions

@temetski
Copy link

Hi, im running 01245f5 right now and I get the error

{"error":{"code":501,
    "message":"This server does not support completions. Start it without `--embeddings`",
    "type":"not_supported_error"}
}

when starting llama-server with --embeddings. Is this another regression or has there been a shift in llama.cpp's workflow?

@a-h
Copy link
Contributor

a-h commented Jul 30, 2024

Looks like an accident, based on reading the commit where it was changed: c3ebcfa#diff-87355a1a297a9f0fdc86af5e2a59cae153290f58d68822cd10c30fee4f7f7076L2005

Looks like it's supposed to check that all the requests in a batch are either embedding or completion, not to prevent use of embeddings and completion within the same server process.

I haven't looked into it more than that, but it doesn't look right at first glance.

@a-h
Copy link
Contributor

a-h commented Jul 30, 2024

Although, reading #8420 - it looks like the workflow has changed, and that to enable embedding and completion, you must now omit the --embedding flag - i.e. the meaning of the --embedding flag now means "embedding only".

Reading through the rest of the comments, seems like the docs are being updated by @okigan

@temetski
Copy link

Thanks @a-h! You're right, the embeddings endpoint does work without --embeddings. I think it's just a matter of updating the documentation for how the flag works.

@edwin0cheng
Copy link

Could anyone tell me how to enable embedding AND completion at the same time in current code base ?
I tried :

  1. with --embedding, only /embedding' works, but /completion` failed with 501 code.
  2. without --embedding, only /completion works, but /embedding failed with 501 code.

@ggerganov
Copy link
Member

@edwin0cheng After #10135, without the --embedding flag should enable both /completion and /embedding endpoints.

@edwin0cheng
Copy link

I just tested it and it works now, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants