Skip to content

Workload Scheduling Constraints (Workload-Level) and Preference-Aware MultiKueue Dispatching #8729

@ichekrygin

Description

@ichekrygin

What would you like to be added:

Kueue today supports preference-based placement via flavor fungibility, but these preferences are soft-only and cannot be expressed as hard scheduling constraints. In addition, MultiKueue dispatching strategies (AllAtOnce, Incremental) are race-based, meaning the first worker cluster that admits a workload wins, regardless of whether that placement is optimal.

This issue proposes extending Kueue with workload-level scheduling constraints and updating MultiKueue dispatching to be preference-aware rather than timing-driven.

Scheduling constraints must be specified per workload, at the workload level.

They should not be global defaults and should not be ClusterQueue-wide policies.
The intent is to allow different workloads sharing the same ClusterQueue to express different scheduling requirements.

This mirrors Kubernetes design patterns, where constraints are typically attached to the object being scheduled (e.g., Pods), not to the scheduler or queue globally.

Why is this needed:

Problem

Single-cluster limitations

Currently, users cannot express strict workload-specific guarantees such as:

  • “This workload must not preempt other workloads.”
  • “This workload must not borrow quota from a cohort.”
  • “If these conditions cannot be met, keep this workload pending.”

If borrowing or preemption is enabled at the ClusterQueue level, Kueue may eventually use them for all workloads, even when a specific workload would prefer to wait.

This makes it impossible to express per-workload hard guarantees, only queue-wide soft ordering.

This limits Kueue’s usefulness for:

  • SLA-sensitive workloads
  • Fairness- or isolation-critical workloads
  • Budget- or quota-bound workloads
  • Mixed workloads sharing the same ClusterQueue

MultiKueue limitations

MultiKueue dispatching modes (AllAtOnce, Incremental) are fundamentally race-based:

  • Workloads are dispatched to multiple worker clusters.
  • The first cluster to admit the workload wins.
  • No comparison of placement quality is performed.

This can result in:

  • A cluster that admits a workload using borrowing winning over a cluster that could admit the same workload without borrowing
  • A cluster that admits a workload using preemption winning over a cluster that could admit it without requiring preemption
  • Non-deterministic placement driven by control-plane timing rather than placement quality
  • Unnecessary workload preemption, even though the workload ultimately runs on a different cluster, because MultiKueue nominated another cluster as the winner

These semantics break the flavor fungibility mental model across clusters.

Example

Assume a workload with no borrowing, no preemption constraints is dispatched to three clusters:

Cluster Admission Result
A Fits without borrowing or preemption
B Fits with borrowing
C Fits with preemption

Today, B or C may win simply because they respond faster.

Moreover, workload preemption will be triggered on cluster C irrespective of the final workload placement, even if the workload ultimately runs on a different cluster.

Desired behavior:

  • The workload should only be admitted on A
  • If A is unavailable, the workload should remain pending

Proposed Direction

1. Add workload-level constraint-aware scheduling to Kueue

Extend the Workload API to support hard placement constraints, evaluated per workload.

Illustrative API examples:

spec:
  admissionConstraints:
    requireNoBorrowing: true
    requireNoPreemption: true

Or a more expressive form:

spec:
  placementPolicy:
    borrowing: Forbidden | Allowed 
    preemption: Forbidden | Allowed

Key properties:

  • Constraints are evaluated per workload
  • Constraints override queue-wide capabilities
  • If constraints are not satisfied, the workload remains pending

2. Surface reasoned admission rejections

Instead of only reporting admitted / not admitted, Kueue should surface structured rejection reasons: "Unsatisfied Admission Constraint due to:"

  • Requires borrowing
  • Requires preemption

This enables higher-level scheduling logic and MultiKueue dispatching to reason about failures.

3. Make MultiKueue dispatching preference-aware

Once workload-level constraints exist, MultiKueue dispatching can move away from races.

Instead of “first admission wins”, MultiKueue should:

For preference tier P1:
  Try all clusters
If none accept:
  Move to P2
Repeat

Preference tiers are derived from workload-level constraints, not queue defaults.

This preserves flavor fungibility semantics across clusters while respecting per-workload guarantees.

Benefits

  • Enables per-workload hard scheduling guarantees
  • Allows heterogeneous workloads to safely share ClusterQueues
  • Preserves flavor fungibility semantics across clusters
  • Eliminates race-based placement
  • Improves determinism and placement quality
  • Avoids unnecessary borrowing and preemption
  • Keeps a clean separation between single-cluster admission and multi-cluster dispatching

Conclusion

This issue proposes introducing a missing scheduling primitive in Kueue: workload-level hard scheduling constraints, and extending MultiKueue to be preference-aware instead of race-based.

The guiding principle is:

The scheduler should select the best feasible placement for a given workload, not the fastest one.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions