Skip to content
This repository was archived by the owner on Apr 22, 2024. It is now read-only.

Conversation

jeffreyftang
Copy link

No description provided.

@jeffreyftang jeffreyftang requested a review from tgaddair February 8, 2023 22:28
@linear
Copy link

linear bot commented Feb 8, 2023

PUX-912 Volcano considers PodGroups as Running prematurely

The move to the Running phase only checks for successful allocation of resources to the PodGroup:

// If there're enough allocated resource, it's running
if int32(allocated) >= jobInfo.PodGroup.Spec.MinMember {
	status.Phase = scheduling.PodGroupRunning
}

This doesn't account for errors or delays (e.g., large image pulls) that could occur between pod allocation and pod startup. Since we rely on the Running status of the PodGroup as a signal that the engine is ready for use, this can lead to timeouts and other bad outcomes in some cases.

This seems to just be a bug in Volcano.

Copy link

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM! Could be a good one to upstream.

@jeffreyftang jeffreyftang merged commit 1b45c29 into predibase Feb 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants