[MultiKueue] Support Long running services

**What would you like to be added**:
When a serving workload needs to move clusters (e.g., GPU type change), Multi-Kueue immediately deletes the old serving when quota moves, causing a service gap until the new serving is ready in the target cluster.
I know other long-running jobs like RayJobs might have similar concerns around service continuity.

**Why is this needed**:
For serving workloads, we'd ideally wait for the replacement to be "Ready" before cleaning up the old one to maintain service availability during cross-cluster transitions. There could be other replacements policies if needed

We can define a new interface:

```go
type MultiKueueAdapterWithCleanupPolicy interface {
      MultiKueueAdapter
      CleanupPolicy() CleanupPolicy
  }
```
I am planning to use this interface(or a modification of that) and extending elasticJobs implementation to allow for what is a called a replacement workload slice. 

**Completion requirements**:

This enhancement requires the following artifacts:

- [x] Design doc
- [x] API change
- [x] Docs update

The artifacts should be linked in subsequent comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MultiKueue] Support Long running services #8526

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MultiKueue] Support Long running services #8526

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions