-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Tighten compilation cache invariants around eagle #17662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/compilation/backends.py
Outdated
# calls in a single model, please open an issue and let's discuss. | ||
speculative_config = self.vllm_config.speculative_config | ||
assert speculative_config is not None | ||
assert speculative_config.method in ("eagle", "eagle3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
assert speculative_config.method in ("eagle", "eagle3") | |
assert speculative_config.use_eagle() |
vllm/compilation/backends.py
Outdated
# The eagle head does not need its own hash; we assume | ||
# the hash of the original model entirely determines the config of | ||
# the eagle head. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small concern is that the eagle3 head often has different hidden size than the original model.
For example, the hidden size of Llama 3.3 70B is 8192 while the hidden size of its eagle3 head (from the eagle3 author) is 6144 (https://huggingface.co/yuhuili/EAGLE3-LLaMA3.3-Instruct-70B)
So, technically an eagle3 head can define its own hidden size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one public eagle3 head per model, so this assumption works for those public heads. I'm a little bit concerned this might not be the case for internal models/heads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "internal models/heads"? Internal to vLLM or Meta or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 Internal to Meta or other companies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 I don't mean to block this PR. I think this PR should be shipped (once the CI passes). I just wanted to heads up about the edge case. Sorry for the confusion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 Internal to Meta or other companies.
That makes sense to me
@zou3519 I don't mean to block this PR. I think this PR should be shipped (once the CI passes). I just wanted to heads up about the edge case. Sorry for the confusion!
Oh I was asking so that I can drop in some more comments here about the current state. I'll update this PR with your comments, thanks for the discussions!
I'm recording down my understanding of how eagle and the compilation cache works after discussing vllm-project#17211 with @luyuzhe111 and @WoosukKwon. In the future we likely will have a situation where we want to torch.compile multiple pieces of code (e.g. decoder and encoder separately) and then we'll need to refactor the system to support it (each compiled region needs its own cache directory with its own hash) But until then the current design seems fine. Signed-off-by: rzou <[email protected]>
Look like the assumptions are wrong (the asserts are triggering on the tests), so we need some fixing. I have some idea of how to do this, it'll be a bigger refactor. [2025-05-09T20:34:59Z] if compilation_counter.num_graphs_seen > 0: |
I'm recording down my understanding of how eagle and the compilation cache works after discussing
#17211 with @luyuzhe111 and @WoosukKwon.
In the future we likely will have a situation where we want to torch.compile multiple pieces of code (e.g. decoder and encoder separately) and then we'll need to refactor the system to support it (each compiled region needs its own cache directory with its own hash) But until then the current design seems fine.