-
Notifications
You must be signed in to change notification settings - Fork 542
Description
What would you like to be added:
Support for evicting workloads which are running on Nodes which become tainted.
One use case for tainting nodes is to allow running another high priority workload on a dedicated set of nodes.
We could probably make it part of the Node Hot Swap, so that when a single node is tainted we may quickly find a replacement.
Part of the task is to determine the scope:
- are we handling NoExecute, or NoSchedule taint too?
- do we have some delay?
- do we use NodeHotSwap
Despite the open questions, we should certainly improve the setup.
Why is this needed:
Currently, pods of such a workload get deleted (evicted) by kubernetes core, but the workload continues to "run", and Kueue's TAS cache still keeps space for the workload.
This prevents starting the new high priority workload quickly.
Currently, we need to wait a couple of minutes for the waitForPodsReady.recoveryTimeout to evict such a workload.
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.