-
Notifications
You must be signed in to change notification settings - Fork 541
Description
What would you like to be added:
Kueue today supports preference-based placement via flavor fungibility, but these preferences are soft-only and cannot be expressed as hard scheduling constraints. In addition, MultiKueue dispatching strategies (AllAtOnce, Incremental) are race-based, meaning the first worker cluster that admits a workload wins, regardless of whether that placement is optimal.
This issue proposes extending Kueue with workload-level scheduling constraints and updating MultiKueue dispatching to be preference-aware rather than timing-driven.
Scheduling constraints must be specified per workload, at the workload level.
They should not be global defaults and should not be ClusterQueue-wide policies.
The intent is to allow different workloads sharing the same ClusterQueue to express different scheduling requirements.
This mirrors Kubernetes design patterns, where constraints are typically attached to the object being scheduled (e.g., Pods), not to the scheduler or queue globally.
Why is this needed:
Problem
Single-cluster limitations
Currently, users cannot express strict workload-specific guarantees such as:
- “This workload must not preempt other workloads.”
- “This workload must not borrow quota from a cohort.”
- “If these conditions cannot be met, keep this workload pending.”
If borrowing or preemption is enabled at the ClusterQueue level, Kueue may eventually use them for all workloads, even when a specific workload would prefer to wait.
This makes it impossible to express per-workload hard guarantees, only queue-wide soft ordering.
This limits Kueue’s usefulness for:
- SLA-sensitive workloads
- Fairness- or isolation-critical workloads
- Budget- or quota-bound workloads
- Mixed workloads sharing the same ClusterQueue
MultiKueue limitations
MultiKueue dispatching modes (AllAtOnce, Incremental) are fundamentally race-based:
- Workloads are dispatched to multiple worker clusters.
- The first cluster to admit the workload wins.
- No comparison of placement quality is performed.
This can result in:
- A cluster that admits a workload using borrowing winning over a cluster that could admit the same workload without borrowing
- A cluster that admits a workload using preemption winning over a cluster that could admit it without requiring preemption
- Non-deterministic placement driven by control-plane timing rather than placement quality
- Unnecessary workload preemption, even though the workload ultimately runs on a different cluster, because MultiKueue nominated another cluster as the winner
These semantics break the flavor fungibility mental model across clusters.
Example
Assume a workload with no borrowing, no preemption constraints is dispatched to three clusters:
| Cluster | Admission Result |
|---|---|
| A | Fits without borrowing or preemption |
| B | Fits with borrowing |
| C | Fits with preemption |
Today, B or C may win simply because they respond faster.
Moreover, workload preemption will be triggered on cluster C irrespective of the final workload placement, even if the workload ultimately runs on a different cluster.
Desired behavior:
- The workload should only be admitted on A
- If A is unavailable, the workload should remain pending
Proposed Direction
1. Add workload-level constraint-aware scheduling to Kueue
Extend the Workload API to support hard placement constraints, evaluated per workload.
Illustrative API examples:
spec:
admissionConstraints:
requireNoBorrowing: true
requireNoPreemption: trueOr a more expressive form:
spec:
placementPolicy:
borrowing: Forbidden | Allowed
preemption: Forbidden | AllowedKey properties:
- Constraints are evaluated per workload
- Constraints override queue-wide capabilities
- If constraints are not satisfied, the workload remains pending
2. Surface reasoned admission rejections
Instead of only reporting admitted / not admitted, Kueue should surface structured rejection reasons: "Unsatisfied Admission Constraint due to:"
- Requires borrowing
- Requires preemption
This enables higher-level scheduling logic and MultiKueue dispatching to reason about failures.
3. Make MultiKueue dispatching preference-aware
Once workload-level constraints exist, MultiKueue dispatching can move away from races.
Instead of “first admission wins”, MultiKueue should:
For preference tier P1:
Try all clusters
If none accept:
Move to P2
Repeat
Preference tiers are derived from workload-level constraints, not queue defaults.
This preserves flavor fungibility semantics across clusters while respecting per-workload guarantees.
Benefits
- Enables per-workload hard scheduling guarantees
- Allows heterogeneous workloads to safely share ClusterQueues
- Preserves flavor fungibility semantics across clusters
- Eliminates race-based placement
- Improves determinism and placement quality
- Avoids unnecessary borrowing and preemption
- Keeps a clean separation between single-cluster admission and multi-cluster dispatching
Conclusion
This issue proposes introducing a missing scheduling primitive in Kueue: workload-level hard scheduling constraints, and extending MultiKueue to be preference-aware instead of race-based.
The guiding principle is:
The scheduler should select the best feasible placement for a given workload, not the fastest one.
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.