Skip to content

expose the pcieroot device attribute by scanning the pci buses#163

Open
ffromani wants to merge 11 commits into
kubernetes-sigs:mainfrom
ffromani:expose-pcieroot-bus-scan
Open

expose the pcieroot device attribute by scanning the pci buses#163
ffromani wants to merge 11 commits into
kubernetes-sigs:mainfrom
ffromani:expose-pcieroot-bus-scan

Conversation

@ffromani
Copy link
Copy Markdown
Contributor

@ffromani ffromani commented Jun 3, 2026

Alternative take to #156 implementing the algo scketched in #156 (comment)

Expose the pcie root attribute for devices in both grouped and individual mode, leveraging the cpu locality of PCIe devices reported by the linux kernel.
The core idea: cpus don't expose their proximity to PCIe roots, but PCIe buses do expose the CPUs local to them, so we can build the reverse mapping and from there feed into attributes.

The reality turned out to be quirkier, and while the approach still works, the current representation tend to collapse towards the NUMA representation, even though on paper the pcie root locality should be finer-grained.
However, the net result is that the reported data is coarser than actually true at silicon level, but never wrong.

As for the implementation, we do that in stages:

  • add machinery to discover CPUs local to a pcieroot. There's no forward mapping CPU->devices, so we do that backwards: enumerate PCI devices, find host bridges, query the kernel for CPUs local to those
  • expose the PCIe root attribute in grouped and individual mode.

special notes for the reviewers

  1. reviewers familiar with expose the pcieroot device attribute finding the roots from devices #156 will notice almost all the changes are confined in pkg/device and in its support code (tools/gen-pcie-testdata/...)
  2. discussion is still pending around the standard pcieRoot attribute and how to handle the scalar/list situation. Until we reach agreement, we can only expose a custom dra.cpu/pcieRoots attribute. Either way, we now depend on the DRAListTypeAttributes FG being set to true.
  3. ARM64 support is intentionally deferred till we gain more experience and more hw support

ffromani added 3 commits June 3, 2026 13:25
move DRA device attribute as reusable constants.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Add a LLM-assisted document which explores how the data this
driver consume for PCIe locality interacts with NUMA locality.

The document provide a detailed, reference causal chain to enable
developers trace back the source of the data, with a walkthrough
of the current limitations, and how we can navigate them.

The document provides the distilled background information which
enables and justify the design decisions for the pcie root attribute
support.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Add a LLM-assisted document which explores how this driver
can efficiently and reliably find the PCI root buses from sysfs.

The document provides the distilled background information which
enables and justify the design decisions for the pcie root attribute
support.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from pohly June 3, 2026 11:30
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 3, 2026
@ffromani
Copy link
Copy Markdown
Contributor Author

ffromani commented Jun 3, 2026

@fmuyassarov @pravk03 @AutuSnow

as promised. I must say: turned even simpler than I anticipated

ffromani and others added 3 commits June 3, 2026 13:35
Linux kernel's sysfs reports which CPUs are local
to which PCIE root. We can leverage this feature
to group CPUs by PCIE root, which is already the
standard attribute exposed by the DRA framework.

This commit adds the scan logic which we will later
use in the DRA layer.

**IMPORTANT** we support only x86_64 initially (and not arm64)
because we don't have yet enough hardware access and expertise
for the ARM platform.

Signed-off-by: Francesco Romani <fromani@redhat.com>
add a support tool which we can run to extract the machine
data from sysfs and create fstest.MapFS data to use in unit test.
The usage is straightforward:
```
$ go run tools/gen-pcie-testdata/main.go -h
Usage of /tmp/go-build4024417054/b001/exe/main:
  -build-tag string
    	inject a //go:build constraint for the current architecture, set "" to disable (default "amd64")
  -func string
    	function name for the generated MapFS fixture (default "makeSysfsFixture")
  -pkg string
    	package name for the generated MapFS fixture (default "device")
  -sysfs-root string
    	sysfs root path (default "/sys")
```

We intentionally don't wire the tool in the Makefile (e.g. no target)
because this is meant to be a developer tool, not something users
consume.

Signed-off-by: Francesco Romani <fromani@redhat.com>
The driver is gaining support for string list attribute, so
called 'advanced features'. This triggers a lower limit
for resource slice size in the apiserver.
Therefore, we need to create smaller resource slices on our side.

Credits to Feruzjon Muyassarov (fmuyassarov) for finding and
investigating the issue.

Co-authored-by: Feruzjon Muyassarov <feruzjon.muyassarov@est.tech>
Signed-off-by: Francesco Romani <fromani@redhat.com>
@fmuyassarov
Copy link
Copy Markdown
Member

Thanks @ffromani, this is on my priority list for today.

@ffromani
Copy link
Copy Markdown
Contributor Author

ffromani commented Jun 4, 2026

Thanks @ffromani, this is on my priority list for today.

Thank you! If it helps, this is where I believe most (like 95%+) of the differences wiht #156 lie:

AI-assisted research to motivate the approach to gather HW data. Plain markdown doc.
97e8bf4

HW detection core change
4dc7f26

Simplified counterparts wrt #156. Same commit, adapted (and actually simplified) to consume the new scanning algo
7438c33
d23eeea

Copy link
Copy Markdown
Contributor

@AutuSnow AutuSnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the bus-scan approach in this PR. It is simpler than #156 and avoids the complexity of scanning all PCI devices and deduplicating roots.

Before merging, I think we should tighten a few behavior/UX points: make PCIe roots opt-in/default-off, avoid failing driver startup on PCIe sysfs discovery unless explicitly enabled, compute grouped-mode roots from allocatable CPUs rather than the full CPU set, and document the DRAListTypeAttributes / matchAttribute constraints clearly.

So overall +1 on the direction,

Comment thread pkg/driver/dra_hooks.go Outdated
Comment thread pkg/driver/dra_hooks.go Outdated
AttributeNumCPUs: {IntValue: ptr.To(availableCPUsInNUMANode)},
}
device.SetCompatibilityAttributes(deviceAttrs, int64(numaID))
cp.setPCIeRootsAttribute(deviceAttrs, numaNodeCPUSet.UnsortedList()...)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as above: I think this should use allocatableCPUs.UnsortedList() rather than numaNodeCPUSet.UnsortedList() so the attribute describes the CPUs that this DRA device can actually allocate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, thanks for catching

Comment thread pkg/driver/driver.go Outdated
Comment thread pkg/driver/dra_hooks.go Outdated
Comment thread README.md
@ffromani
Copy link
Copy Markdown
Contributor Author

ffromani commented Jun 4, 2026

I like the bus-scan approach in this PR. It is simpler than #156 and avoids the complexity of scanning all PCI devices and deduplicating roots.  Before merging, I think we should tighten a few behavior/UX points: make PCIe roots opt-in/default-off, avoid failing driver startup on PCIe sysfs discovery unless explicitly enabled, compute grouped-mode roots from allocatable CPUs rather than the full CPU set, and document the DRAListTypeAttributes / matchAttribute constraints clearly.  So overall +1 on the direction,

Thanks for the review. Fair points, which in the current shape apply to both approaches. I also need to catch up with last comments and updates you reviewers requested in #156 .
Honestly I'm waiting for consensus about which direction should we pick so I can post updates just to one PR :)

ffromani added 3 commits June 4, 2026 13:27
We can reverse map each cpuid to the pcieroot it is local
(or reported local) to, and add the corresponding
attribute to each cpu device regardless if individual
or grouped mode.

Because how the linux kernel retrieves the data from the firmware,
and because how the ACPI data reports locality, the PCIe root mapping acts
as coarse alignment hint nd it is not consumed by the internal CPU selection
when the drivers operates in grouped mode.

Therefore, currently, the PCIe root locality data is an alias
for the NUMA locality data. It is paramount to however note that
the fact the data is an alias is a *current implementation detail*
not a semantic equivalence; actually the two signals, NUMA locality
and PCIe root locality, are designed to be different and the latter
is expected to have equal or finer resolution than the former.
IOW, the current state of things is an implementation artifact.

There's also another catch: the standard attribute is a string value,
but because the way the Linux kernel reporting is currently implemented,
a CPU core may be reported local to more than a PCIe root.

Proper reporting would require StringValues - more than a single root.
Therefore, we now need to depend on the feature gate `DRAListTypeAttributes=true`.
We add a local `dra.cpu/pcieRoots` (plural) attribute to report all the roots local to a CPU device.

There is work in the community around the KEP-5491 to make it
transparent to the consumers the fact the standard `resource.kubernetes.io/pcieRoot`
attribute can hold both a single scalar value or a list.
But still this change may break existing CEL expressions and it's
arguable a breaking change.
Until the community agrees and settles for a decision and a
recommendation, we intentionally DO NOT expose the standard attribute,
only the custom one.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Bootstrap resource slice tests to check the attribute values.
Enable DRAListTypeAttributes, and check we populate
`dra.cpu/pcieRoots` correctly and consistently.
We can't predict what will show up on CI and how, so the tests
are, out of necessity, laxer than they can be.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Refactor and incapsulate the PCIe root mapping in its own
simple helper type.
Because the community is still figuring out the full details
of the DRA List Type Attributes backward compatibility, and
because even our custom attribute depend on a alpha feature,
we make the code opt-in hoping to make our user flows
smooth as possible, including the upgrade path.

The intention is however to enable the PCIe root reporting
by default (switching the default value) once the DRA List
Type support enters beta.

Furthermore, as soon as the kubernetes community agrees
on a backward compatibility path, we will revisit and
likely add the standard `resource.kubernetes.io/pcieRoot` attribute.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani ffromani force-pushed the expose-pcieroot-bus-scan branch from aedfbb3 to 2c3fa97 Compare June 4, 2026 12:40
@ffromani
Copy link
Copy Markdown
Contributor Author

ffromani commented Jun 4, 2026

edited inline the documentation as pointed out; the opt-in behavior is implemented in the last 2 commits to minimize the changes and then the review churn.

I think that this split has merit, but I'll leave to reviewers if I should merge the changes in the parent commits in the next upload.

Now that the PCIe roots reporting is optional, we can
restore the old publishing behavior and produce again
the large possible ResourceSlices.

The 64-device limit is correct while list attributes may be
emitted, but it can be unnecessarily conservative otherwise.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani ffromani force-pushed the expose-pcieroot-bus-scan branch 2 times, most recently from 0d91c09 to 11e02d2 Compare June 4, 2026 13:03
Now that the pcie root reporting is opt-in, and will be so
till at least 1.38, we need to pass through the flag through
helm.

Since the feature is a strict superset of the old behavior,
make sure to enable it on CI.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani ffromani force-pushed the expose-pcieroot-bus-scan branch from 11e02d2 to 2e2a3bf Compare June 4, 2026 13:06
Copy link
Copy Markdown
Contributor

@pravk03 pravk03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm still catching up on this PR and the discussions in #156, but my initial impression is that the top-down scanning from /sys/devices/pci* and reading CPU locality directly from the roots looks to be simpler and faster.

Comment thread pkg/driver/driver.go
// may set list-type attributes (StringValues) such as PCIe roots.
return resourceapi.ResourceSliceMaxDevicesWithAdvancedFeatures
}
return maxDevicesPerResourceSlice
Copy link
Copy Markdown
Contributor

@pravk03 pravk03 Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use resourceapi.ResourceSliceMaxDevices here and in tests and remove maxDevicesPerResourceSlice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! I don't have preferences. @AutuSnow pointed out that's unnecessary to restrict the slice size if we don't expose list attributes, and both approaches have pros and cons. Open to chat about this :)

@ffromani
Copy link
Copy Markdown
Contributor Author

ffromani commented Jun 4, 2026

Thanks. I'm still catching up on this PR and the discussions in #156, but my initial impression is that the top-down scanning from /sys/devices/pci* and reading CPU locality directly from the roots looks to be simpler and faster.

no problem at all. I think the list type convo will probably not be settled before the next WG device management mtg, so there's time to review. No rush, especially considering we're waist-deep the KEP season.

}

// IsPCIeRootName matches the constructions of the pcie roots in the linux kernel.
// The kernel format string is "pci%04x:%02x" (drivers/pci/probe.c:1028).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I should mention this applies to linux 7.0.9.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants