Skip to content

Update prefix scorer to report cached prefix length in tokens#2053

Merged
k8s-ci-robot merged 12 commits intokubernetes-sigs:mainfrom
mayabar:update-prefix-scorer
Jan 27, 2026
Merged

Update prefix scorer to report cached prefix length in tokens#2053
k8s-ci-robot merged 12 commits intokubernetes-sigs:mainfrom
mayabar:update-prefix-scorer

Conversation

@mayabar
Copy link
Copy Markdown
Contributor

@mayabar mayabar commented Jan 5, 2026

What this PR does / why we need it:
Currently, the prefix length stored in the prefix cache plugin is measured in blocks.

As part of enabling easy configuration for disaggregated PD support in the inference scheduler, all configuration field units will use tokens. This involves converting from characters to tokens using the average token length constant.

Which issue(s) this PR fixes:
Fixes #2068

Does this PR introduce a user-facing change?:

Prefix Plugin Changes
- New parameter: Added `blockSizeTokens` to prefix plugin configuration, defining cache block length in tokens (replacing character-based sizing).
- Deprecation notice: The legacy `blockSize` parameter is deprecated. Instantiating the prefix plugin will fail if `blockSize` is defined without also specifying `blockSizeTokens`.
- Data unit update: Changed data stored in `PrepareRequestData` in the prefix plugin from blocks to tokens.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Jan 5, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit d3ad1e6
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6976028fd6146200084f9254
😎 Deploy Preview https://deploy-preview-2053--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mayabar. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 5, 2026
@mayabar mayabar changed the title WIP: Update prefix scorer to report cached prefix length in tokens Update prefix scorer to report cached prefix length in tokens Jan 5, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026
Copy link
Copy Markdown
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add more context in the description why this is needed.

/ok-to-test

Comment thread pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Outdated
Comment thread pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Outdated
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 5, 2026
@mayabar mayabar requested a review from ahg-g January 5, 2026 14:03
// The input prompt is broken into sizes of BlockSizeTokens to calculate block hashes . Requests
// with length shorter than the block size will be ignored.
BlockSize int `json:"blockSize"`
BlockSizeTokens int `json:"blockSize"`
Copy link
Copy Markdown
Contributor

@ahg-g ahg-g Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add to the description that this PR introduces a user-facing change (the user here being the one who deploys the epp); This PR removes a config variable and adds a new one with different semantics.

In fact, we should keep the old variable, mark it as deprecated and fail to instantiate the plugin if set with an error message to instruct the user to migrate to the new parameter with its new semantics.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2026
@mayabar mayabar force-pushed the update-prefix-scorer branch from 40656e7 to b301782 Compare January 6, 2026 12:27
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2026
state := &SchedulingContextState{
PrefixHashes: hashes,
PrefixCacheServers: p.matchLongestPrefix(ctx, hashes),
PrefixCacheServers: p.matchLongestPrefix(ctx, hashes, blockSize),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not need to set the blockSize parameter here like we do in Score?

// A map of server to its longest prefix cache match length.
PrefixCacheServers map[ServerID]int
// Size of a block in tokens
BlockSize int
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be consistent and also name this BlockSizeTokens?

// Update servers with their longest prefix match.
res[server]++
// Update servers with their longest prefix match, prefix length is in tokens.
res[server] += blockSize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to report the longest prefix in tokens, isn't it enough to track it in terms of number of blocks? The less the number of places where we make the blockSize a factor the better, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we are switching to tokens here is that this value is used by llm-d-scheduler (in disaggregated PD), if it's defined as a number of blocks, the scheduler would also need to know the block size, since PD's decision about disaggregation is based on the absolute length of non-cached suffix (measured in tokens).


total := len(state.PrefixHashes)
// total prefix length in tokens
total := len(state.PrefixHashes) * blockSize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If matchLongestPrefix reports the number of matched blocks, then we don't need to multiply by blockSize here, right? May be I am missing something, but If we do that, wouldn't we restrict the relevance and use of the blockSizeTokens to the function that computes the hashes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You right, if we will return back to block units in matchLongestPrefix, we can stay here without multiplication in block length. But matchLongestPrefix is used in PrepareRequestData where we want to report prefix length in tokens.
Maybe the right approach is to use blocks in all places and multiply in block size only in PrepareRequestData, what do you think?

@mayabar mayabar requested a review from ahg-g January 6, 2026 13:26
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2026
@mayabar mayabar force-pushed the update-prefix-scorer branch from b301782 to c1cea68 Compare January 8, 2026 08:14
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 8, 2026
@mayabar mayabar force-pushed the update-prefix-scorer branch from c1cea68 to 30d4ffd Compare January 12, 2026 07:47
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2026
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2026
@liu-cong
Copy link
Copy Markdown
Contributor

Sorry for the delayed response. Somehow I missed it in my inbox.. @ahg-g

The current config API is not ideal, IIUC it forces the user to make an assumption about the average characters per token and align that with what IGW assumes.

First of all this is mitigated by the "autoTune" feature where we can automatically set these knobs. In the cases where users need to provide this config (e.g., a model server that doesn't provide such metrics), it's true that users need to understand the "magic number", however I think this is by design because the user needs to know that this is just a magic number and IGW doesn't have the true token number.

If that is correct, then I think the change this PR makes to the API is better than what we have right now.

I think renaming this to blockSizeTokens gives a impression that the scorer understands tokens whereas it doesn't. This is misleading. Also implementation wise, internally the scorers takes a fixed block size and does not change due to different input tokens. So it's the blockSize of the input (characters), not tokens.

For the UX gaps, with "autoTune", and future enhancements such as tracking the exact token count in the response metrics and tune the LRU capacity dynamically, we can completely hide this from the users.

So to sum it, I supporting adding that internal "pseudo token count" but not changing the user facing config.

@ahg-g
Copy link
Copy Markdown
Contributor

ahg-g commented Jan 21, 2026

autoTune is a great UX, but that is not the focus of the change, it is about the other parameter that exists in the API right now. However, looking at the implementation of autoTune, the block_size metric we read from the model server is in tokens, so that makes it doubly important to change this parameter to tokens to better align with the autoTune feature.

Good UX is one with the minimum API surface, in the case where this parameter is set and one can't use autoTune, forcing the user to think about the magic number serves no value to the user as they can't change it since it must align with our internal assumption on how to translate tokens to chars in terms of size. It is also not intuitive given that the value the user configures on the model server is block_size in tokens.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 21, 2026
@liu-cong
Copy link
Copy Markdown
Contributor

Thinking a bit more here...

Let's take a step back and forget the auto tune feature for a moment. The scorer has 2 config knobs: the block size, which is how it partitions the input text; and the LRUCapacityPerServer, which is how many blocks to keep in the lookup table for each pod. An important note is that the blockSize doesn't necessarily need to match what vLLM has, though matching is preferred. A naive case is blockSize=1, we simply do a full prefix match. The vLLM blockSize and the magic number come into play when users need to estimate the LRUCapacityPerServer. Users don't have to use "IGW magic number" if they have a better number. So far I think the semantics are clear, though not a great UX.

Now if we change it to blockSizeTokens, I am unclear what it means. Does it mean a) the block size in tokens that the scorers hashes the request or b) the block size of the model server? The former is not true because the scorer doesn't have access to tokens. The latter could work, and the UX is simpler. If we choose this approach, we should also change the semantics of the LRUCapacityPerServer to something like tokenCapacityPerServer. I don't think changing the blockSizeTokens only makes sense, it adds more confusion without also changing the capacity config.

So I think we have these options:

  1. Keep existing config. They are semantically sound and flexible (not tied to the magic number). UX is not great.
  2. Change both configs to token based. the scorer takes these hints, applies the magic number (and we can make this dynamic later on). UX is simpler.

Sorry for the back and forth. This just shows the current UX is not straightforward. I am supportive of option 2 to align these to vllm, but let's make it consistent.

@ahg-g
Copy link
Copy Markdown
Contributor

ahg-g commented Jan 22, 2026

Changing both to be token based sounds good to me as well.

…d of tokens

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
…contains length values in tokens.

- Add block size to SchedulingContextState of the prefix cache plugin.
- Tests partial updates

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
… defined in chars and the new one defined in tokens, update tests accordingly

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
…tored in PrepareRequestData add size of block in tokens

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
@mayabar mayabar force-pushed the update-prefix-scorer branch from b8d7d46 to 68be9bb Compare January 25, 2026 09:31
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 25, 2026
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
@liu-cong
Copy link
Copy Markdown
Contributor

@mayabar Sorry for the back and forth. Can you also change the LRUCapacituPerServer to tokens as discussed in #2053 (comment)? Will lgtm once that's done. Thanks again for your patience!

@mayabar
Copy link
Copy Markdown
Contributor Author

mayabar commented Jan 26, 2026

@liu-cong @ahg-g
Moving to consistent units is a great idea, but changing the LRUCapacityPerServer parameter is out of score for this PR. I suggest opening an issue for it and making that change in a separate PR. In addition, there's another parameter, MaxPrefixBlocksToMatch, that might also make sense to convert to tokens. What do you think?

@liu-cong
Copy link
Copy Markdown
Contributor

@mayabar

MaxPrefixBlocksToMatch is number of blocks and is irrelevant of the unit (token or character), so we are fine there.

I strongly believe we should make both unit change in one PR otherwise the middle state just adds confusion. Can you do a quick follow up PR to change the other? If so I can lgtm this. Thanks for your understanding.

@mayabar
Copy link
Copy Markdown
Contributor Author

mayabar commented Jan 26, 2026

MaxPrefixBlocksToMatch defines the maximum prefix length to match. While it's currently measured in blocks, I don't see a reason not to align it with other configuration fields and define it in tokens. This is similar to LRUCapacityPerServer, which is also currently defined in blocks and have its units changed to tokens soon.

@liu-cong
Copy link
Copy Markdown
Contributor

RE: MaxPrefixBlocksToMatch

The blockSize and LRUCapacityPerServer parameters are "soft" knobs - we are trying to mimic model server but not necessarily the exact. MaxPrefixBlocksToMatch, however, is a "hard" one. We won't be able to respect it without access to exact token count. Therefore I suggest not changing it. In fact, vllm uses num_gpu_blocks` as the cache capacity metric. So we can just keep blockSize in tokens and others in blocks.

@liu-cong
Copy link
Copy Markdown
Contributor

/lgtm
/unhold

Thanks! I can follow up on updating docs and changing other config knobs

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 27, 2026
@k8s-ci-robot k8s-ci-robot merged commit 5df0acd into kubernetes-sigs:main Jan 27, 2026
11 checks passed
elevran pushed a commit to llm-d/llm-d-inference-scheduler that referenced this pull request Apr 23, 2026
…etes-sigs/gateway-api-inference-extension#2053)

* matchLongestPrefix returns cached prefix length in characters instead of tokens

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* - Change data stored for prefix cache plugin in the prepareData step contains length values in tokens.
- Add block size to SchedulingContextState of the prefix cache plugin.
- Tests partial updates

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* fix merge problem

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* fixes

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* rename BlockSize to BlockSizeTokens in prefix plugin

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* In the prefix plugin, keep both block size parameters: the legacy one defined in chars and the new one defined in tokens, update tests accordingly

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* fix documentation and test

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* change prefix plugin implementation to work in block units, in data stored in PrepareRequestData add size of block in tokens

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* cosmetic changes

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* typo

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* rename according the PR review

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

* fix merge

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

---------

Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change measuring units in prefix plugin

4 participants