Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,8 @@ endif
--set image.pullPolicy=IfNotPresent \
--set args.logLevel=6 \
--set args.cpuDeviceMode=${DRACPU_E2E_CPU_DEVICE_MODE} \
--set-string args.reservedCPUs=${DRACPU_E2E_RESERVED_CPUS}
--set-string args.reservedCPUs=${DRACPU_E2E_RESERVED_CPUS} \
--set args.exposePCIeRoots=true
hack/ci/wait-resourcelices.sh

build-test-image: ## build tests image
Expand Down
70 changes: 70 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,76 @@ However, this is only a partial replacement of the corresponding CPU Manager opt
We hardcode the NUMA split and, unlike the cpumanager feature, it won't automatically adapt if the same claim is handled by a 1-NUMA, 2-NUMA or 4-NUMA machine;
the claim would need to be updated or recreated manually.

### Exposing PCIe roots
Comment thread
ffromani marked this conversation as resolved.

The DRA CPU Driver can expose the PCIe root locality of CPU devices via the driver-specific `dra.cpu/pcieRoots` attribute.
This feature is opt-in, and requires _both_ the `DRAListTypeAttributes` Feature Gate (see KEP-5491) enabled in the cluster and the `--expose-pcie-roots` command line
flag in the driver. The driver has no way to introspect the cluster feature gate states, so care must be taken to keep the configuration consistent.

**IMPORTANT NOTE**: it is recommended to consume the pcieRoots list attributes using the `matchAttribute` or [the derived attributes](https://github.com/kubernetes/enhancements/issues/6080).
Care must be taken to consume the attribute using the CEL expressions selector, because the backward compatibility path is not yet clear
(see: https://github.com/kubernetes/enhancements/pull/6081#issuecomment-4606653735 and following)

#### Implementation details

While devices don't expose the PCIe root locality, the reverse is true: the linux kernel does report the CPUs local to PCIe buses and devices; the driver scans the PCIe
buses and tracks the PCIe host bridges CPU locality; from there, we can reconstruct the CPU to PCIe root mapping and then populate the attributes.

Because how the linux kernel retrieves the data from the firmware, and because how the ACPI data reports locality, the PCIe root mapping acts as coarse alignment hint
and it is not consumed by the internal CPU selection when the driver operates in grouped mode.
While the driver can consume the PCIe root locality data in its internal CPU selection, the locality granularity hard depends on the kernel-provided data.
These limitations are planned to be addressed (CPU selection input) and mitigated (coarse granularity) in future versions of the driver.

This is an example of a resource slice produced by a driver running in a kind CI cluster, grouped mode, grouping by numa nodes:

```yaml
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
creationTimestamp: "2026-05-29T14:09:35Z"
generateName: 00000-dra.cpu-dra-driver-cpu-worker-
generation: 1
name: 00000-dra.cpu-dra-driver-cpu-worker-v7pdl
ownerReferences:
- apiVersion: v1
controller: true
kind: Node
name: dra-driver-cpu-worker
uid: 80fbb23c-ae26-44b4-a21a-dce4037db82d
resourceVersion: "651"
uid: 08664794-f96b-43fd-b8ce-233c7bd172f6
spec:
devices:
- allowMultipleAllocations: true
attributes:
dra.cpu/numCPUs:
int: 31
dra.cpu/numaNodeID:
int: 0
dra.cpu/pcieRoots:
strings:
- pci0000:00
dra.cpu/smtEnabled:
bool: true
dra.cpu/socketID:
int: 0
dra.net/numaNode:
int: 0
capacity:
dra.cpu/cpu:
value: "31"
name: cpudevnuma000
driver: dra.cpu
nodeName: dra-driver-cpu-worker
pool:
generation: 1
name: dra-driver-cpu-worker
resourceSliceCount: 1
```

Note the amount of PCIe roots may vary and depends on both the physical wiring of the system and on whether slots are populated or not;
most firmware don't enumerate PCIe buses - and therefore don't expose PCIe roots - if no devices are connected.

## Workload Configuration Requirements

Currently, Kubernetes has two separate systems for requesting CPU resources: standard requests in pod/container fields (`pod.spec.resources` or `pod.spec.containers[].resources`) and DRA `ResourceClaim`s.
Expand Down
3 changes: 3 additions & 0 deletions cmd/dracpu/app.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ var (
ready atomic.Bool
cpuDeviceMode string
groupBy string
exposePCIeRoots bool
)

type cpuDeviceModeValue struct {
Expand Down Expand Up @@ -109,6 +110,7 @@ func init() {
flag.StringVar(&reservedCPUs, "reserved-cpus", "", "cpuset of CPUs to be excluded from ResourceSlice.")
flag.Var(newCPUDeviceModeValue(&cpuDeviceMode, driver.CPU_DEVICE_MODE_GROUPED), "cpu-device-mode", "Sets the mode for exposing CPU devices. 'grouped' exposes a single device per socket or numa node (based on --group-by). 'individual' exposes each CPU as a separate device.")
flag.Var(newGroupByValue(&groupBy, driver.GROUP_BY_NUMA_NODE), "group-by", "When --cpu-device-mode=grouped, sets the criteria for grouping CPUs. Can be set to 'socket' or 'numanode'.")
flag.BoolVar(&exposePCIeRoots, "expose-pcie-roots", exposePCIeRoots, "Discover and expose PCIe roots as device attributes. Requires the `DRAListTypeAttributes=true` Feature Gate in the cluster.")
}

func main() {
Expand Down Expand Up @@ -205,6 +207,7 @@ func run(logger logr.Logger) error {
ReservedCPUs: reservedCPUSet,
CPUDeviceMode: cpuDeviceMode,
CPUDeviceGroupBy: groupBy,
ExposePCIeRoots: exposePCIeRoots,
}
dracpu, asyncErr, err := driver.Start(ctx, clientset, driverConfig)
if err != nil {
Expand Down
1 change: 1 addition & 0 deletions deployment/helm/dra-driver-cpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ helm install dra-driver-cpu ./deployment/helm/dra-driver-cpu -n kube-system -f m
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| args.cpuDeviceMode | string | `"grouped"` | CPU exposure mode: `grouped` (expose NUMA nodes or sockets as devices) or `individual` (expose each CPU as a device) |
| args.exposePCIeRoots | bool | `false` | Discover and expose PCIe roots as device attributes. Requires the `DRAListTypeAttributes=true` feature gate in the cluster |
| args.groupBy | string | `"numanode"` | Grouping criteria when `cpuDeviceMode=grouped`: `numanode` or `socket` |
| args.hostnameOverride | string | `""` | Override the node name the driver registers under; omitted when empty |
| args.logLevel | int | `4` | Log verbosity level passed as `--v` |
Expand Down
3 changes: 3 additions & 0 deletions deployment/helm/dra-driver-cpu/templates/daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,9 @@ spec:
{{- if .Values.args.hostnameOverride }}
- --hostname-override={{ .Values.args.hostnameOverride }}
{{- end }}
{{- if .Values.args.exposePCIeRoots }}
- --expose-pcie-roots
{{- end }}
image: {{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
Expand Down
4 changes: 4 additions & 0 deletions deployment/helm/dra-driver-cpu/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@
"individual"
]
},
"exposePCIeRoots": {
"description": "Discover and expose PCIe roots as device attributes. Requires the `DRAListTypeAttributes=true` feature gate in the cluster",
"type": "boolean"
},
"groupBy": {
"description": "Grouping criteria when `cpuDeviceMode=grouped`: `numanode` or `socket`",
"type": "string",
Expand Down
2 changes: 2 additions & 0 deletions deployment/helm/dra-driver-cpu/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ args:
reservedCPUs: ""
# -- Override the node name the driver registers under; omitted when empty
hostnameOverride: ""
# -- Discover and expose PCIe roots as device attributes. Requires the `DRAListTypeAttributes=true` feature gate in the cluster
exposePCIeRoots: false # @schema type:boolean

# -- Path for liveness and readiness probes
healthzPath: /healthz
Expand Down
92 changes: 92 additions & 0 deletions docs/dev/pci-bus-linux-sysfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Deep dive: finding and exploring PCI/PCIe root buses

disclosure: this document was created with AI assistance.
The research was human-steered by LLM-operated; the document was then reviewed by humans
for correctness and edited extensively.

This document assumes kernel code, architecture and hardware management knowledge.

This document tries to be architecture agnostic whenever possible, but the author can only claim sufficient
knowledge about the `x86_64` architecture.
The reader must assume the "silicon" and hardware is assumed to be `x86_64` unless clearly specified otherwise.

## Key assumptions and key omissions

- Bare-metal x86 servers (roughly 2020+ generation).
- Linux kernel behavior as in version 7.0.9.

## Summary

After analyzing the user-visible sysfs pseudo filesystem layout from both userspace and kernel source code,
there are two reliable ways to identify root buses in linux:

1. **Resolve each symlink in `/sys/class/pci_bus`** and check whether the path between `devices/` and `pci_bus/`
contains only the `pciDDDD:BB` root directory (root bus) or also contains intermediate BDF device components (child bus).
1. **Enumerate `/sys/devices/pci*/`** directly — every directory matching that glob is a root complex, by construction.

While both methods are equivalent, the method 2 avoids symlink resolution, which can lead to simpler implementation
and/or data wrapping.

## How Root Buses Are Created

Root buses are registered exclusively through `pci_register_host_bridge()`.

**`drivers/pci/probe.c:987-1061`** — the relevant sequence:

```
line 1000: bus = pci_alloc_bus(NULL); // NULL parent = root bus
line 1028: dev_set_name(&bridge->dev, "pci%04x:%02x", ...) // creates "pci0000:XX"
line 1037: device_add(&bridge->dev); // registers /sys/devices/pci0000:XX/
line 1041: bus->bridge = get_device(&bridge->dev);
line 1052: bus->dev.class = &pcibus_class; // associates with /sys/class/pci_bus/
line 1053: bus->dev.parent = bus->bridge; // parent = the bridge device
line 1055: dev_set_name(&bus->dev, "%04x:%02x", ...) // bus device named "0000:XX"
line 1058: device_register(&bus->dev); // registers the bus class device
```

The call at line 1028 is the **only place** in the entire kernel that creates the `pci%04x:%02x` device name pattern
(verification: grep in `drivers/pci/`).
Therefore: every directory under `/sys/devices/` matching `pci[0-9a-f]*:[0-9a-f]*` is a root complex created by this function.

The bus class device (line 1058) gets registered with `pcibus_class`, which creates the `/sys/class/pci_bus/0000:XX` symlink.

Since `bus->dev.parent = bus->bridge` (line 1053), and `bus->bridge` is the host bridge device at `/sys/devices/pci0000:XX/`,
the class symlink target path resolves to:

```
../../devices/pci0000:XX/pci_bus/0000:XX
```

## How Secondary Buses Are Created

Secondary (`child` in linux kernel parlance) buses are created through `pci_alloc_child_bus()`.

**`drivers/pci/probe.c:1200-1269`** — the relevant sequence:

```
line 1209: child = pci_alloc_bus(parent); // non-NULL parent = child bus
line 1213: child->parent = parent;
line 1214: child->sysdata = parent->sysdata; // inherits sysdata (including NUMA node)
line 1227: child->dev.class = &pcibus_class;
line 1228: dev_set_name(&child->dev, "%04x:%02x", ...) // same naming scheme
line 1242: child->dev.parent = child->bridge; // parent = the PCI-to-PCI bridge *device*
line 1265: device_register(&child->dev);
```

The critical difference is at line 1242: `child->bridge` points to the **PCI-to-PCI bridge device**
(a `pci_dev` with a BDF address like `0000:00:0e.0`), not the host bridge.
This device sits within the parent bus's sysfs directory. So the class symlink resolves to a deeper path:

```
../../devices/pci0000:00/0000:00:0e.0/pci_bus/0000:03
^^^^^^^^^^^^
bridge device BDF component
```

For multi-level topologies, each intermediate bridge adds another BDF segment:

```
../../devices/pci0000:00/0000:00:0e.0/0000:03:00.0/pci_bus/0000:04
^^^^^^^^^^^^ ^^^^^^^^^^^^
1st bridge 2nd bridge
```
Loading
Loading