-
Notifications
You must be signed in to change notification settings - Fork 541
Description
What would you like to be added:
When a serving workload needs to move clusters (e.g., GPU type change), Multi-Kueue immediately deletes the old serving when quota moves, causing a service gap until the new serving is ready in the target cluster.
I know other long-running jobs like RayJobs might have similar concerns around service continuity.
Why is this needed:
For serving workloads, we'd ideally wait for the replacement to be "Ready" before cleaning up the old one to maintain service availability during cross-cluster transitions. There could be other replacements policies if needed
We can define a new interface:
type MultiKueueAdapterWithCleanupPolicy interface {
MultiKueueAdapter
CleanupPolicy() CleanupPolicy
}I am planning to use this interface(or a modification of that) and extending elasticJobs implementation to allow for what is a called a replacement workload slice.
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.