Skip to content

TAS: evict workloads which are running on nodes which become tainted #8828

@mimowo

Description

@mimowo

What would you like to be added:

Support for evicting workloads which are running on Nodes which become tainted.

One use case for tainting nodes is to allow running another high priority workload on a dedicated set of nodes.

We could probably make it part of the Node Hot Swap, so that when a single node is tainted we may quickly find a replacement.

Part of the task is to determine the scope:

  • are we handling NoExecute, or NoSchedule taint too?
  • do we have some delay?
  • do we use NodeHotSwap

Despite the open questions, we should certainly improve the setup.

Why is this needed:

Currently, pods of such a workload get deleted (evicted) by kubernetes core, but the workload continues to "run", and Kueue's TAS cache still keeps space for the workload.

This prevents starting the new high priority workload quickly.

Currently, we need to wait a couple of minutes for the waitForPodsReady.recoveryTimeout to evict such a workload.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions