Skip to content

POC for Opaque config in ResourceClaim status instead of Individual Mode#165

Draft
pravk03 wants to merge 4 commits into
kubernetes-sigs:mainfrom
pravk03:opaque-poc
Draft

POC for Opaque config in ResourceClaim status instead of Individual Mode#165
pravk03 wants to merge 4 commits into
kubernetes-sigs:mainfrom
pravk03:opaque-poc

Conversation

@pravk03
Copy link
Copy Markdown
Contributor

@pravk03 pravk03 commented Jun 4, 2026

POC for #164

pravk03 added 4 commits June 4, 2026 16:22
Allows bypassing the automatic topology-packed CPU allocator in
GROUP_BY_MACHINE mode by specifying a custom "cpuset" parameter
in the claim's opaque configurations.
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pravk03

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 4, 2026
Copy link
Copy Markdown
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PoC. It seems to me this is already in a pretty good shape. The main blockers I'm seeing atm are

  1. decision on having the opaque data mandatory or optional/recommended
  2. the schema of the opaque data. It's very implicit. I wonder how we can make it explicit, besides docs.
  3. some e2e tests

Comment thread pkg/driver/dra_hooks.go
})
}
case GROUP_BY_MACHINE:
deviceName := fmt.Sprintf("%s%03d", cpuDeviceMachineGroupedPrefix, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be a constant, we will always have a single device

Comment thread pkg/driver/dra_hooks.go
allocatableCPUs := allCPUs.Difference(cp.reservedCPUs)
availableCPUs := int64(allocatableCPUs.Size())

if allocatableCPUs.Size() > 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be a generic pre-flight check in driver.Start() or, in general, should be done earlier I believe

Comment thread pkg/driver/dra_hooks.go
Comment on lines +376 to +379
} else {
availableCPUsForDevice := sharedCPUs.Difference(cpuAssignment)
logger.V(4).Info("no opaque cpuset config override found, falling back to topology-packed allocator", "device", alloc.Device, "availableCPUs", availableCPUsForDevice.String())
cur, err = cpumanager.TakeByTopologyNUMAPacked(logger, topo, availableCPUsForDevice, int(claimCPUCount), cpumanager.CPUSortingStrategyPacked, true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably the main/only open point in the PoC/proposal: if we do not get a cpuset in the opaque data, should we hard fail or do a best-effort allocation?
If the driver focuses on esclusive CPUs allocation, I don't see much value in giving X random exclusive CPUs.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pravk03
Copy link
Copy Markdown
Contributor Author

pravk03 commented Jun 6, 2026

  1. decision on having the opaque data mandatory or optional/recommended

The current allocation logic attempts to pack CPUs within a single NUMA node and spills over across NUMA domains only if it cannot satisfy the request on a single node. So, this already provides a soft guarantee.

Some points for keeping opaque parameter optional:

  • The behavior is similar to Kubelet's best-effort and restricted Topology Manager policies, so this would give us better feature parity with Kubelet.
  • Currently, with socket or numanode groupings, if a request cannot fit within this boundary, claim allocation fails. The fallback allows us to run workloads across these boundaries if they are tolerant. I am thinking smaller machines with multiple NUMA nodes. Or machines with a lot of NUMA nodes here.

However, the point you bring up for making opaque mandatory is also valid. It would result in unpredictable performance depending on how we allocate the CPUs.

All in all, I am ok either options here. Or, the third option is adding a flag to make it configurable, though the question then becomes what the default behavior should be. Maybe we can start with making opaque mandatory as that is the use case at hand while we think more about other options.

  1. the schema of the opaque data. It's very implicit. I wonder how we can make it explicit, besides docs.

Good point. I will think more and try to come up with a proposal.

  1. some e2e tests

I think we need this only if we allow fallback. With opaque, it requires mocking the scheduler to write the opaque config to the claim's status. I haven't looked at how feasible this is within our E2E framework yet. Plus, we would only be testing the parsing logic, which we can easily cover with unit tests instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants