expose the pcieroot device attribute by scanning the pci buses#163
expose the pcieroot device attribute by scanning the pci buses#163ffromani wants to merge 11 commits into
Conversation
move DRA device attribute as reusable constants. Signed-off-by: Francesco Romani <fromani@redhat.com>
Add a LLM-assisted document which explores how the data this driver consume for PCIe locality interacts with NUMA locality. The document provide a detailed, reference causal chain to enable developers trace back the source of the data, with a walkthrough of the current limitations, and how we can navigate them. The document provides the distilled background information which enables and justify the design decisions for the pcie root attribute support. Signed-off-by: Francesco Romani <fromani@redhat.com>
Add a LLM-assisted document which explores how this driver can efficiently and reliably find the PCI root buses from sysfs. The document provides the distilled background information which enables and justify the design decisions for the pcie root attribute support. Signed-off-by: Francesco Romani <fromani@redhat.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@fmuyassarov @pravk03 @AutuSnow as promised. I must say: turned even simpler than I anticipated |
Linux kernel's sysfs reports which CPUs are local to which PCIE root. We can leverage this feature to group CPUs by PCIE root, which is already the standard attribute exposed by the DRA framework. This commit adds the scan logic which we will later use in the DRA layer. **IMPORTANT** we support only x86_64 initially (and not arm64) because we don't have yet enough hardware access and expertise for the ARM platform. Signed-off-by: Francesco Romani <fromani@redhat.com>
add a support tool which we can run to extract the machine
data from sysfs and create fstest.MapFS data to use in unit test.
The usage is straightforward:
```
$ go run tools/gen-pcie-testdata/main.go -h
Usage of /tmp/go-build4024417054/b001/exe/main:
-build-tag string
inject a //go:build constraint for the current architecture, set "" to disable (default "amd64")
-func string
function name for the generated MapFS fixture (default "makeSysfsFixture")
-pkg string
package name for the generated MapFS fixture (default "device")
-sysfs-root string
sysfs root path (default "/sys")
```
We intentionally don't wire the tool in the Makefile (e.g. no target)
because this is meant to be a developer tool, not something users
consume.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The driver is gaining support for string list attribute, so called 'advanced features'. This triggers a lower limit for resource slice size in the apiserver. Therefore, we need to create smaller resource slices on our side. Credits to Feruzjon Muyassarov (fmuyassarov) for finding and investigating the issue. Co-authored-by: Feruzjon Muyassarov <feruzjon.muyassarov@est.tech> Signed-off-by: Francesco Romani <fromani@redhat.com>
0befd73 to
aedfbb3
Compare
|
Thanks @ffromani, this is on my priority list for today. |
Thank you! If it helps, this is where I believe most (like 95%+) of the differences wiht #156 lie: AI-assisted research to motivate the approach to gather HW data. Plain markdown doc. HW detection core change Simplified counterparts wrt #156. Same commit, adapted (and actually simplified) to consume the new scanning algo |
AutuSnow
left a comment
There was a problem hiding this comment.
I like the bus-scan approach in this PR. It is simpler than #156 and avoids the complexity of scanning all PCI devices and deduplicating roots.
Before merging, I think we should tighten a few behavior/UX points: make PCIe roots opt-in/default-off, avoid failing driver startup on PCIe sysfs discovery unless explicitly enabled, compute grouped-mode roots from allocatable CPUs rather than the full CPU set, and document the DRAListTypeAttributes / matchAttribute constraints clearly.
So overall +1 on the direction,
| AttributeNumCPUs: {IntValue: ptr.To(availableCPUsInNUMANode)}, | ||
| } | ||
| device.SetCompatibilityAttributes(deviceAttrs, int64(numaID)) | ||
| cp.setPCIeRootsAttribute(deviceAttrs, numaNodeCPUSet.UnsortedList()...) |
There was a problem hiding this comment.
Same concern as above: I think this should use allocatableCPUs.UnsortedList() rather than numaNodeCPUSet.UnsortedList() so the attribute describes the CPUs that this DRA device can actually allocate.
There was a problem hiding this comment.
absolutely, thanks for catching
Thanks for the review. Fair points, which in the current shape apply to both approaches. I also need to catch up with last comments and updates you reviewers requested in #156 . |
We can reverse map each cpuid to the pcieroot it is local (or reported local) to, and add the corresponding attribute to each cpu device regardless if individual or grouped mode. Because how the linux kernel retrieves the data from the firmware, and because how the ACPI data reports locality, the PCIe root mapping acts as coarse alignment hint nd it is not consumed by the internal CPU selection when the drivers operates in grouped mode. Therefore, currently, the PCIe root locality data is an alias for the NUMA locality data. It is paramount to however note that the fact the data is an alias is a *current implementation detail* not a semantic equivalence; actually the two signals, NUMA locality and PCIe root locality, are designed to be different and the latter is expected to have equal or finer resolution than the former. IOW, the current state of things is an implementation artifact. There's also another catch: the standard attribute is a string value, but because the way the Linux kernel reporting is currently implemented, a CPU core may be reported local to more than a PCIe root. Proper reporting would require StringValues - more than a single root. Therefore, we now need to depend on the feature gate `DRAListTypeAttributes=true`. We add a local `dra.cpu/pcieRoots` (plural) attribute to report all the roots local to a CPU device. There is work in the community around the KEP-5491 to make it transparent to the consumers the fact the standard `resource.kubernetes.io/pcieRoot` attribute can hold both a single scalar value or a list. But still this change may break existing CEL expressions and it's arguable a breaking change. Until the community agrees and settles for a decision and a recommendation, we intentionally DO NOT expose the standard attribute, only the custom one. Signed-off-by: Francesco Romani <fromani@redhat.com>
Bootstrap resource slice tests to check the attribute values. Enable DRAListTypeAttributes, and check we populate `dra.cpu/pcieRoots` correctly and consistently. We can't predict what will show up on CI and how, so the tests are, out of necessity, laxer than they can be. Signed-off-by: Francesco Romani <fromani@redhat.com>
Refactor and incapsulate the PCIe root mapping in its own simple helper type. Because the community is still figuring out the full details of the DRA List Type Attributes backward compatibility, and because even our custom attribute depend on a alpha feature, we make the code opt-in hoping to make our user flows smooth as possible, including the upgrade path. The intention is however to enable the PCIe root reporting by default (switching the default value) once the DRA List Type support enters beta. Furthermore, as soon as the kubernetes community agrees on a backward compatibility path, we will revisit and likely add the standard `resource.kubernetes.io/pcieRoot` attribute. Signed-off-by: Francesco Romani <fromani@redhat.com>
aedfbb3 to
2c3fa97
Compare
|
edited inline the documentation as pointed out; the opt-in behavior is implemented in the last 2 commits to minimize the changes and then the review churn. I think that this split has merit, but I'll leave to reviewers if I should merge the changes in the parent commits in the next upload. |
Now that the PCIe roots reporting is optional, we can restore the old publishing behavior and produce again the large possible ResourceSlices. The 64-device limit is correct while list attributes may be emitted, but it can be unnecessarily conservative otherwise. Signed-off-by: Francesco Romani <fromani@redhat.com>
0d91c09 to
11e02d2
Compare
Now that the pcie root reporting is opt-in, and will be so till at least 1.38, we need to pass through the flag through helm. Since the feature is a strict superset of the old behavior, make sure to enable it on CI. Signed-off-by: Francesco Romani <fromani@redhat.com>
11e02d2 to
2e2a3bf
Compare
pravk03
left a comment
There was a problem hiding this comment.
Thanks. I'm still catching up on this PR and the discussions in #156, but my initial impression is that the top-down scanning from /sys/devices/pci* and reading CPU locality directly from the roots looks to be simpler and faster.
| // may set list-type attributes (StringValues) such as PCIe roots. | ||
| return resourceapi.ResourceSliceMaxDevicesWithAdvancedFeatures | ||
| } | ||
| return maxDevicesPerResourceSlice |
There was a problem hiding this comment.
We can use resourceapi.ResourceSliceMaxDevices here and in tests and remove maxDevicesPerResourceSlice
There was a problem hiding this comment.
True! I don't have preferences. @AutuSnow pointed out that's unnecessary to restrict the slice size if we don't expose list attributes, and both approaches have pros and cons. Open to chat about this :)
no problem at all. I think the list type convo will probably not be settled before the next WG device management mtg, so there's time to review. No rush, especially considering we're waist-deep the KEP season. |
| } | ||
|
|
||
| // IsPCIeRootName matches the constructions of the pcie roots in the linux kernel. | ||
| // The kernel format string is "pci%04x:%02x" (drivers/pci/probe.c:1028). |
There was a problem hiding this comment.
nit: I should mention this applies to linux 7.0.9.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Alternative take to #156 implementing the algo scketched in #156 (comment)
Expose the pcie root attribute for devices in both grouped and individual mode, leveraging the cpu locality of PCIe devices reported by the linux kernel.
The core idea: cpus don't expose their proximity to PCIe roots, but PCIe buses do expose the CPUs local to them, so we can build the reverse mapping and from there feed into attributes.
The reality turned out to be quirkier, and while the approach still works, the current representation tend to collapse towards the NUMA representation, even though on paper the pcie root locality should be finer-grained.
However, the net result is that the reported data is coarser than actually true at silicon level, but never wrong.
As for the implementation, we do that in stages:
special notes for the reviewers
pkg/deviceand in its support code (tools/gen-pcie-testdata/...)pcieRootattribute and how to handle the scalar/list situation. Until we reach agreement, we can only expose a customdra.cpu/pcieRootsattribute. Either way, we now depend on theDRAListTypeAttributesFG being set to true.