-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Core][V0] Add guidance backend for structured output #14589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][V0] Add guidance backend for structured output #14589
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
5244b7f
to
2db1a0e
Compare
Hey @russellb this looks fantastic! On point 3 -- is it possible to share an example of a workload where TPOT is reducing vs. xgrammar? I don't typically see this in JSON schemas so it'd be a helpful case to look for performance improvements. |
I'm using these changes to the benchmark suite on top of this PR: #14567 I'm running the following command, changing the structured output backend to python3 benchmarks/benchmark_serving_structured_output.py --port 8432 --model meta-llama/Llama-3.1-8B-Instruct --dataset json-unique --structured-output-ratio 1.0 --structured-output-backend guidance --output-len 300 --num-prompts 90 --request-rate 1 |
2db1a0e
to
d66d439
Compare
This has been merged and I rebased this branch on top of |
Testing results via @joerunde are positive. I think this is worth including to help affected users get by a little longer while people transition to V1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick thoughts, but we can get this one in given that it is already being used in prod for v0.
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in for v0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work keeping it minimal, just a few questions
This is really a blocking issue for v0.8.0? It looks like features that is nice to have? |
yeah I don't think this would block 0.8.0. nice to have |
Can you fix the merge conflict? |
import os | ||
from typing import Any, List, Type, Union | ||
|
||
import llguidance # type: ignore[import-untyped] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably remove this type: ignore since guidance-ai/llguidance#139
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once that makes it into a release at least
The failure in buildkite/ci/pr/entrypoints-test is unrelated. |
This is the V1 integration for [guidance](https://github.com/guidance-ai/llguidance) as a backend for structured output. There is a V0 integration in vllm-project#14589. This backend provides some key benefits to V1: * Broader jsonschema support * Quick startup performance for large schemas Instead of precomputing the masks for all states, this is done on the fly. We see very fast request startup times, even for large schemas. This should make V1 roughly feature equivalent to V0 in terms of the types of schemas it can support. More technical details are available in the llguidance git repo. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]>
@russellb I have simplified version here: russellb/vllm@llguidance-v0-integration...mmoskal:llg_v0_matcher will hopefully be able to test it later today |
Entrypoints tests is failing consistently, PTAL. |
Seems like an outlines issue with json_object? which is not related to this PR. But I think that Michal wants to merge a diff implementation later. |
Yes - I'm going to push a PR to fix it today.
That's fine, though it could also be a follow-up PR if this is a working iteration to start with and build on. |
For outlines, we actually fall back to |
The tests should work now -- I made |
requirements/common.txt
Outdated
@@ -18,6 +18,7 @@ pillow # Required for image processing | |||
prometheus-fastapi-instrumentator >= 7.0.0 | |||
tiktoken >= 0.6.0 # Required for DBRX tokenizer | |||
lm-format-enforcer >= 0.10.11, < 0.11 | |||
llguidance>=0.6.15; platform_machine == "x86_64" or platform_machine == "arm64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is based on the PR vllm-project#10217. It is updated to be compatible with `main`. It has also been significantly updated and simplified to take advantage of changes to llguidance since the original PR was written. Many thanks to the llguidance team for the help on this. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
ff4b448
to
2631863
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM great work folks
This is the V1 integration for [guidance](https://github.com/guidance-ai/llguidance) as a backend for structured output. There is a V0 integration in vllm-project#14589. This backend provides some key benefits to V1: * Broader jsonschema support * Quick startup performance for large schemas Instead of precomputing the masks for all states, this is done on the fly. We see very fast request startup times, even for large schemas. This should make V1 roughly feature equivalent to V0 in terms of the types of schemas it can support. An `auto` mode is also included, which includes opinionated fallback behavior based on our current understanding for varying feature support and performance characteristics for different scenarios. More technical details are available in the llguidance git repo. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]>
This is the V1 integration for [guidance](https://github.com/guidance-ai/llguidance) as a backend for structured output. There is a V0 integration in vllm-project#14589. This backend provides some key benefits to V1: * Broader jsonschema support * Quick startup performance for large schemas Instead of precomputing the masks for all states, this is done on the fly. We see very fast request startup times, even for large schemas. This should make V1 roughly feature equivalent to V0 in terms of the types of schemas it can support. An `auto` mode is also included, which includes opinionated fallback behavior based on our current understanding for varying feature support and performance characteristics for different scenarios. More technical details are available in the llguidance git repo. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]>
…4589) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
…4589) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
…4589) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
…4589) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
…4589) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
…4589) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Signed-off-by: Mu Huai <[email protected]>
This commit is based on the PR #10217. It is updated to be compatible
with
main
.Signed-off-by: Russell Bryant [email protected]
Co-authored-by: Loc Huynh [email protected]
Co-authored-by: Michal Moskal [email protected]
I started looking at this after talking to @joerunde about some performance issues observed in production. While the ultimate goal is to get everyone to V1 where we expect to provide more drastic improvements, I wanted to see if we could do something to help in V0 in the meantime.
In both my own testing and from the team that reported the problem, performance with guidelines has resolved their problems. The problem was outlines taking too long up front to compile complex json schemas.
I think this is worth including to help people get by in V0 until they're ready for the full transition to V1.
Closes #14151