Update prefix scorer to report cached prefix length in tokens by mayabar · Pull Request #2053 · kubernetes-sigs/gateway-api-inference-extension

mayabar · 2026-01-05T08:31:19Z

What this PR does / why we need it:
Currently, the prefix length stored in the prefix cache plugin is measured in blocks.

As part of enabling easy configuration for disaggregated PD support in the inference scheduler, all configuration field units will use tokens. This involves converting from characters to tokens using the average token length constant.

Which issue(s) this PR fixes:
Fixes #2068

Does this PR introduce a user-facing change?:

Prefix Plugin Changes
- New parameter: Added `blockSizeTokens` to prefix plugin configuration, defining cache block length in tokens (replacing character-based sizing).
- Deprecation notice: The legacy `blockSize` parameter is deprecated. Instantiating the prefix plugin will fail if `blockSize` is defined without also specifying `blockSizeTokens`.
- Data unit update: Changed data stored in `PrepareRequestData` in the prefix plugin from blocks to tokens.

netlify · 2026-01-05T08:31:26Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`d3ad1e6`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6976028fd6146200084f9254
😎 Deploy Preview	https://deploy-preview-2053--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-01-05T08:31:28Z

Hi @mayabar. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ahg-g

Can you please add more context in the description why this is needed.

/ok-to-test

ahg-g · 2026-01-05T14:17:30Z

+	// The input prompt is broken into sizes of BlockSizeTokens to calculate block hashes . Requests
 	// with length shorter than the block size will be ignored.
-	BlockSize int `json:"blockSize"`
+	BlockSizeTokens int `json:"blockSize"`


We need to add to the description that this PR introduces a user-facing change (the user here being the one who deploys the epp); This PR removes a config variable and adds a new one with different semantics.

In fact, we should keep the old variable, mark it as deprecated and fail to instantiate the plugin if set with an error message to instruct the user to migrate to the new parameter with its new semantics.

ahg-g · 2026-01-06T12:55:58Z

 	state := &SchedulingContextState{
 		PrefixHashes:       hashes,
-		PrefixCacheServers: p.matchLongestPrefix(ctx, hashes),
+		PrefixCacheServers: p.matchLongestPrefix(ctx, hashes, blockSize),


do we not need to set the blockSize parameter here like we do in Score?

ahg-g · 2026-01-06T12:59:42Z

 	// A map of server to its longest prefix cache match length.
 	PrefixCacheServers map[ServerID]int
+	// Size of a block in tokens
+	BlockSize int


should we be consistent and also name this BlockSizeTokens?

ahg-g · 2026-01-06T13:06:58Z

-				// Update servers with their longest prefix match.
-				res[server]++
+				// Update servers with their longest prefix match, prefix length is in tokens.
+				res[server] += blockSize


why do we need to report the longest prefix in tokens, isn't it enough to track it in terms of number of blocks? The less the number of places where we make the blockSize a factor the better, right?

The reason we are switching to tokens here is that this value is used by llm-d-scheduler (in disaggregated PD), if it's defined as a number of blocks, the scheduler would also need to know the block size, since PD's decision about disaggregation is based on the absolute length of non-cached suffix (measured in tokens).

ahg-g · 2026-01-06T13:09:20Z


-	total := len(state.PrefixHashes)
+	// total prefix length in tokens
+	total := len(state.PrefixHashes) * blockSize


If matchLongestPrefix reports the number of matched blocks, then we don't need to multiply by blockSize here, right? May be I am missing something, but If we do that, wouldn't we restrict the relevance and use of the blockSizeTokens to the function that computes the hashes.

You right, if we will return back to block units in matchLongestPrefix, we can stay here without multiplication in block length. But matchLongestPrefix is used in PrepareRequestData where we want to report prefix length in tokens.
Maybe the right approach is to use blocks in all places and multiply in block size only in PrepareRequestData, what do you think?

liu-cong · 2026-01-21T06:41:17Z

Sorry for the delayed response. Somehow I missed it in my inbox.. @ahg-g

The current config API is not ideal, IIUC it forces the user to make an assumption about the average characters per token and align that with what IGW assumes.

First of all this is mitigated by the "autoTune" feature where we can automatically set these knobs. In the cases where users need to provide this config (e.g., a model server that doesn't provide such metrics), it's true that users need to understand the "magic number", however I think this is by design because the user needs to know that this is just a magic number and IGW doesn't have the true token number.

If that is correct, then I think the change this PR makes to the API is better than what we have right now.

I think renaming this to blockSizeTokens gives a impression that the scorer understands tokens whereas it doesn't. This is misleading. Also implementation wise, internally the scorers takes a fixed block size and does not change due to different input tokens. So it's the blockSize of the input (characters), not tokens.

For the UX gaps, with "autoTune", and future enhancements such as tracking the exact token count in the response metrics and tune the LRU capacity dynamically, we can completely hide this from the users.

So to sum it, I supporting adding that internal "pseudo token count" but not changing the user facing config.

ahg-g · 2026-01-21T10:03:55Z

autoTune is a great UX, but that is not the focus of the change, it is about the other parameter that exists in the API right now. However, looking at the implementation of autoTune, the block_size metric we read from the model server is in tokens, so that makes it doubly important to change this parameter to tokens to better align with the autoTune feature.

Good UX is one with the minimum API surface, in the case where this parameter is set and one can't use autoTune, forcing the user to think about the magic number serves no value to the user as they can't change it since it must align with our internal assumption on how to translate tokens to chars in terms of size. It is also not intuitive given that the value the user configures on the model server is block_size in tokens.

liu-cong · 2026-01-22T13:26:41Z

Thinking a bit more here...

Let's take a step back and forget the auto tune feature for a moment. The scorer has 2 config knobs: the block size, which is how it partitions the input text; and the LRUCapacityPerServer, which is how many blocks to keep in the lookup table for each pod. An important note is that the blockSize doesn't necessarily need to match what vLLM has, though matching is preferred. A naive case is blockSize=1, we simply do a full prefix match. The vLLM blockSize and the magic number come into play when users need to estimate the LRUCapacityPerServer. Users don't have to use "IGW magic number" if they have a better number. So far I think the semantics are clear, though not a great UX.

Now if we change it to blockSizeTokens, I am unclear what it means. Does it mean a) the block size in tokens that the scorers hashes the request or b) the block size of the model server? The former is not true because the scorer doesn't have access to tokens. The latter could work, and the UX is simpler. If we choose this approach, we should also change the semantics of the LRUCapacityPerServer to something like tokenCapacityPerServer. I don't think changing the blockSizeTokens only makes sense, it adds more confusion without also changing the capacity config.

So I think we have these options:

Keep existing config. They are semantically sound and flexible (not tied to the magic number). UX is not great.
Change both configs to token based. the scorer takes these hints, applies the magic number (and we can make this dynamic later on). UX is simpler.

Sorry for the back and forth. This just shows the current UX is not straightforward. I am supportive of option 2 to align these to vllm, but let's make it consistent.

ahg-g · 2026-01-22T13:44:19Z

Changing both to be token based sounds good to me as well.

…d of tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com>

…contains length values in tokens. - Add block size to SchedulingContextState of the prefix cache plugin. - Tests partial updates Signed-off-by: Maya Barnea <mayab@il.ibm.com>

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

… defined in chars and the new one defined in tokens, update tests accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com>

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

…tored in PrepareRequestData add size of block in tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com>

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

liu-cong · 2026-01-25T23:17:03Z

@mayabar Sorry for the back and forth. Can you also change the LRUCapacituPerServer to tokens as discussed in #2053 (comment)? Will lgtm once that's done. Thanks again for your patience!

mayabar · 2026-01-26T08:43:58Z

@liu-cong @ahg-g
Moving to consistent units is a great idea, but changing the LRUCapacityPerServer parameter is out of score for this PR. I suggest opening an issue for it and making that change in a separate PR. In addition, there's another parameter, MaxPrefixBlocksToMatch, that might also make sense to convert to tokens. What do you think?

liu-cong · 2026-01-26T10:27:41Z

@mayabar

MaxPrefixBlocksToMatch is number of blocks and is irrelevant of the unit (token or character), so we are fine there.

I strongly believe we should make both unit change in one PR otherwise the middle state just adds confusion. Can you do a quick follow up PR to change the other? If so I can lgtm this. Thanks for your understanding.

mayabar · 2026-01-26T14:50:10Z

MaxPrefixBlocksToMatch defines the maximum prefix length to match. While it's currently measured in blocks, I don't see a reason not to align it with other configuration fields and define it in tokens. This is similar to LRUCapacityPerServer, which is also currently defined in blocks and have its units changed to tokens soon.

liu-cong · 2026-01-27T03:46:16Z

RE: MaxPrefixBlocksToMatch

The blockSize and LRUCapacityPerServer parameters are "soft" knobs - we are trying to mimic model server but not necessarily the exact. MaxPrefixBlocksToMatch, however, is a "hard" one. We won't be able to respect it without access to exact token count. Therefore I suggest not changing it. In fact, vllm uses num_gpu_blocks` as the cache capacity metric. So we can just keep blockSize in tokens and others in blocks.

liu-cong · 2026-01-27T03:47:03Z

/lgtm
/unhold

Thanks! I can follow up on updating docs and changing other config knobs

…etes-sigs/gateway-api-inference-extension#2053) * matchLongestPrefix returns cached prefix length in characters instead of tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com> * - Change data stored for prefix cache plugin in the prepareData step contains length values in tokens. - Add block size to SchedulingContextState of the prefix cache plugin. - Tests partial updates Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix merge problem Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fixes Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename BlockSize to BlockSizeTokens in prefix plugin Signed-off-by: Maya Barnea <mayab@il.ibm.com> * In the prefix plugin, keep both block size parameters: the legacy one defined in chars and the new one defined in tokens, update tests accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix documentation and test Signed-off-by: Maya Barnea <mayab@il.ibm.com> * change prefix plugin implementation to work in block units, in data stored in PrepareRequestData add size of block in tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com> * cosmetic changes Signed-off-by: Maya Barnea <mayab@il.ibm.com> * typo Signed-off-by: Maya Barnea <mayab@il.ibm.com> * rename according the PR review Signed-off-by: Maya Barnea <mayab@il.ibm.com> * fix merge Signed-off-by: Maya Barnea <mayab@il.ibm.com> --------- Signed-off-by: Maya Barnea <mayab@il.ibm.com>

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026

k8s-ci-robot requested review from danehans and kfswain January 5, 2026 08:31

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 5, 2026

mayabar mentioned this pull request Jan 5, 2026

Extend support for different ways to decide if disaggregated PD is required llm-d/llm-d-inference-scheduler#531

Merged

mayabar changed the title ~~WIP: Update prefix scorer to report cached prefix length in tokens~~ Update prefix scorer to report cached prefix length in tokens Jan 5, 2026

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026

ahg-g reviewed Jan 5, 2026

View reviewed changes

Comment thread pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Outdated

Comment thread pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Outdated

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 5, 2026

mayabar requested a review from ahg-g January 5, 2026 14:03

ahg-g reviewed Jan 5, 2026

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2026

mayabar force-pushed the update-prefix-scorer branch from 40656e7 to b301782 Compare January 6, 2026 12:27

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2026

ahg-g reviewed Jan 6, 2026

View reviewed changes

mayabar requested a review from ahg-g January 6, 2026 13:26

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2026

mayabar force-pushed the update-prefix-scorer branch from b301782 to c1cea68 Compare January 8, 2026 08:14

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 8, 2026

mayabar force-pushed the update-prefix-scorer branch from c1cea68 to 30d4ffd Compare January 12, 2026 07:47

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 21, 2026

mayabar added 11 commits January 25, 2026 11:21

matchLongestPrefix returns cached prefix length in characters instea…

b4fe685

…d of tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com>

- Change data stored for prefix cache plugin in the prepareData step …

17bb5dc

…contains length values in tokens. - Add block size to SchedulingContextState of the prefix cache plugin. - Tests partial updates Signed-off-by: Maya Barnea <mayab@il.ibm.com>

fix merge problem

78c9ad5

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

fixes

8e7056a

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

rename BlockSize to BlockSizeTokens in prefix plugin

cfb8294

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

In the prefix plugin, keep both block size parameters: the legacy one…

572f716

… defined in chars and the new one defined in tokens, update tests accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com>

fix documentation and test

a13e102

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

change prefix plugin implementation to work in block units, in data s…

b17a433

…tored in PrepareRequestData add size of block in tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com>

cosmetic changes

a2d953c

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

typo

902cf97

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

rename according the PR review

68be9bb

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

mayabar force-pushed the update-prefix-scorer branch from b8d7d46 to 68be9bb Compare January 25, 2026 09:31

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 25, 2026

fix merge

d3ad1e6

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 27, 2026

k8s-ci-robot merged commit 5df0acd into kubernetes-sigs:main Jan 27, 2026
11 checks passed

Conversation

mayabar commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify Bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

ahg-g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ahg-g Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ahg-g Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ahg-g Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

mayabar Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ahg-g Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

mayabar Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

liu-cong commented Jan 21, 2026

Uh oh!

ahg-g commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liu-cong commented Jan 22, 2026

Uh oh!

ahg-g commented Jan 22, 2026

Uh oh!

liu-cong commented Jan 25, 2026

Uh oh!

mayabar commented Jan 26, 2026

Uh oh!

liu-cong commented Jan 26, 2026

Uh oh!

mayabar commented Jan 26, 2026

Uh oh!

liu-cong commented Jan 27, 2026

Uh oh!

liu-cong commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mayabar commented Jan 5, 2026 •

edited

Loading

netlify Bot commented Jan 5, 2026 •

edited

Loading

ahg-g Jan 5, 2026 •

edited

Loading

ahg-g commented Jan 21, 2026 •

edited

Loading