Skip to content

[MultiKueue] Support Long running services #8526

@ravisantoshgudimetla

Description

@ravisantoshgudimetla

What would you like to be added:
When a serving workload needs to move clusters (e.g., GPU type change), Multi-Kueue immediately deletes the old serving when quota moves, causing a service gap until the new serving is ready in the target cluster.
I know other long-running jobs like RayJobs might have similar concerns around service continuity.

Why is this needed:
For serving workloads, we'd ideally wait for the replacement to be "Ready" before cleaning up the old one to maintain service availability during cross-cluster transitions. There could be other replacements policies if needed

We can define a new interface:

type MultiKueueAdapterWithCleanupPolicy interface {
      MultiKueueAdapter
      CleanupPolicy() CleanupPolicy
  }

I am planning to use this interface(or a modification of that) and extending elasticJobs implementation to allow for what is a called a replacement workload slice.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/multikueueIssues or PRs related to MultiKueuekind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions