Update prefix scorer to report cached prefix length in tokens#2053
Update prefix scorer to report cached prefix length in tokens#2053k8s-ci-robot merged 12 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @mayabar. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ahg-g
left a comment
There was a problem hiding this comment.
Can you please add more context in the description why this is needed.
/ok-to-test
| // The input prompt is broken into sizes of BlockSizeTokens to calculate block hashes . Requests | ||
| // with length shorter than the block size will be ignored. | ||
| BlockSize int `json:"blockSize"` | ||
| BlockSizeTokens int `json:"blockSize"` |
There was a problem hiding this comment.
We need to add to the description that this PR introduces a user-facing change (the user here being the one who deploys the epp); This PR removes a config variable and adds a new one with different semantics.
In fact, we should keep the old variable, mark it as deprecated and fail to instantiate the plugin if set with an error message to instruct the user to migrate to the new parameter with its new semantics.
40656e7 to
b301782
Compare
| state := &SchedulingContextState{ | ||
| PrefixHashes: hashes, | ||
| PrefixCacheServers: p.matchLongestPrefix(ctx, hashes), | ||
| PrefixCacheServers: p.matchLongestPrefix(ctx, hashes, blockSize), |
There was a problem hiding this comment.
do we not need to set the blockSize parameter here like we do in Score?
| // A map of server to its longest prefix cache match length. | ||
| PrefixCacheServers map[ServerID]int | ||
| // Size of a block in tokens | ||
| BlockSize int |
There was a problem hiding this comment.
should we be consistent and also name this BlockSizeTokens?
| // Update servers with their longest prefix match. | ||
| res[server]++ | ||
| // Update servers with their longest prefix match, prefix length is in tokens. | ||
| res[server] += blockSize |
There was a problem hiding this comment.
why do we need to report the longest prefix in tokens, isn't it enough to track it in terms of number of blocks? The less the number of places where we make the blockSize a factor the better, right?
There was a problem hiding this comment.
The reason we are switching to tokens here is that this value is used by llm-d-scheduler (in disaggregated PD), if it's defined as a number of blocks, the scheduler would also need to know the block size, since PD's decision about disaggregation is based on the absolute length of non-cached suffix (measured in tokens).
|
|
||
| total := len(state.PrefixHashes) | ||
| // total prefix length in tokens | ||
| total := len(state.PrefixHashes) * blockSize |
There was a problem hiding this comment.
If matchLongestPrefix reports the number of matched blocks, then we don't need to multiply by blockSize here, right? May be I am missing something, but If we do that, wouldn't we restrict the relevance and use of the blockSizeTokens to the function that computes the hashes.
There was a problem hiding this comment.
You right, if we will return back to block units in matchLongestPrefix, we can stay here without multiplication in block length. But matchLongestPrefix is used in PrepareRequestData where we want to report prefix length in tokens.
Maybe the right approach is to use blocks in all places and multiply in block size only in PrepareRequestData, what do you think?
b301782 to
c1cea68
Compare
c1cea68 to
30d4ffd
Compare
|
Sorry for the delayed response. Somehow I missed it in my inbox.. @ahg-g
First of all this is mitigated by the "autoTune" feature where we can automatically set these knobs. In the cases where users need to provide this config (e.g., a model server that doesn't provide such metrics), it's true that users need to understand the "magic number", however I think this is by design because the user needs to know that this is just a magic number and IGW doesn't have the true token number.
I think renaming this to For the UX gaps, with "autoTune", and future enhancements such as tracking the exact token count in the response metrics and tune the LRU capacity dynamically, we can completely hide this from the users. So to sum it, I supporting adding that internal "pseudo token count" but not changing the user facing config. |
|
autoTune is a great UX, but that is not the focus of the change, it is about the other parameter that exists in the API right now. However, looking at the implementation of autoTune, the block_size metric we read from the model server is in tokens, so that makes it doubly important to change this parameter to tokens to better align with the autoTune feature. Good UX is one with the minimum API surface, in the case where this parameter is set and one can't use autoTune, forcing the user to think about the magic number serves no value to the user as they can't change it since it must align with our internal assumption on how to translate tokens to chars in terms of size. It is also not intuitive given that the value the user configures on the model server is block_size in tokens. |
|
Thinking a bit more here... Let's take a step back and forget the auto tune feature for a moment. The scorer has 2 config knobs: the block size, which is how it partitions the input text; and the LRUCapacityPerServer, which is how many blocks to keep in the lookup table for each pod. An important note is that the blockSize doesn't necessarily need to match what vLLM has, though matching is preferred. A naive case is blockSize=1, we simply do a full prefix match. The vLLM blockSize and the magic number come into play when users need to estimate the LRUCapacityPerServer. Users don't have to use "IGW magic number" if they have a better number. So far I think the semantics are clear, though not a great UX. Now if we change it to So I think we have these options:
Sorry for the back and forth. This just shows the current UX is not straightforward. I am supportive of option 2 to align these to vllm, but let's make it consistent. |
|
Changing both to be token based sounds good to me as well. |
…d of tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com>
…contains length values in tokens. - Add block size to SchedulingContextState of the prefix cache plugin. - Tests partial updates Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
… defined in chars and the new one defined in tokens, update tests accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
…tored in PrepareRequestData add size of block in tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
b8d7d46 to
68be9bb
Compare
|
@mayabar Sorry for the back and forth. Can you also change the LRUCapacituPerServer to tokens as discussed in #2053 (comment)? Will lgtm once that's done. Thanks again for your patience! |
|
@liu-cong @ahg-g |
|
I strongly believe we should make both unit change in one PR otherwise the middle state just adds confusion. Can you do a quick follow up PR to change the other? If so I can lgtm this. Thanks for your understanding. |
|
|
|
RE: MaxPrefixBlocksToMatch The |
|
/lgtm Thanks! I can follow up on updating docs and changing other config knobs |
…etes-sigs/gateway-api-inference-extension#2053) * matchLongestPrefix returns cached prefix length in characters instead of tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com> * - Change data stored for prefix cache plugin in the prepareData step contains length values in tokens. - Add block size to SchedulingContextState of the prefix cache plugin. - Tests partial updates Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix merge problem Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fixes Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename BlockSize to BlockSizeTokens in prefix plugin Signed-off-by: Maya Barnea <mayab@il.ibm.com> * In the prefix plugin, keep both block size parameters: the legacy one defined in chars and the new one defined in tokens, update tests accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix documentation and test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * change prefix plugin implementation to work in block units, in data stored in PrepareRequestData add size of block in tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com> * cosmetic changes Signed-off-by: Maya Barnea <mayab@il.ibm.com> * typo Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename according the PR review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix merge Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com>
What this PR does / why we need it:
Currently, the prefix length stored in the prefix cache plugin is measured in blocks.
As part of enabling easy configuration for disaggregated PD support in the inference scheduler, all configuration field units will use tokens. This involves converting from characters to tokens using the average token length constant.
Which issue(s) this PR fixes:
Fixes #2068
Does this PR introduce a user-facing change?: