feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus#2802
Conversation
Pull Request Test Coverage Report for Build 18313170282Details
💛 - Coveralls |
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you for this @astefanutti!
Please can you also add the integration: https://github.com/kubeflow/trainer/blob/master/test/integration/controller/trainjob_controller_test.go#L74 and e2e test: https://github.com/kubeflow/trainer/blob/master/test/e2e/e2e_test.go#L67, where we check that JobsStatus.
| } | ||
| if len(f.jobsStatusPlugins) != 0 { | ||
| return f.jobsStatusPlugins[0].JobsStatus(ctx, trainJob) |
There was a problem hiding this comment.
@tenzen-y @astefanutti What is the use-case when we can have more than one terminalConditionPlugin or jobsStatusPlugin ?
There was a problem hiding this comment.
@andreyvelich that would be if two plugins implement the same plugin interface so they would conflict.
The check is currently performed for each call, maybe we could do that during the runtime framework initialization.
There was a problem hiding this comment.
But I am trying to understand when users can define more than 1 implementation for such plugins ?
Terminal condition or job status should always be defined by one plugin, unless I am missing something ?
There was a problem hiding this comment.
Conceptually yes, but practically it is possible to have multiple plugins implementing the same interface like framework.TerminalConditionPlugin.
We could imagine this happens if we had other backend than JobSet, e.g., LWS and both would be made active by "mistake".
There was a problem hiding this comment.
both would be made active by "mistake".
If that is set by mistake, shall we validate that only one plugin can be activated for such interfaces during framework initialization ?
There was a problem hiding this comment.
Also, make those plugins as not arrays here:
trainer/pkg/runtime/framework/core/framework.go
Lines 48 to 49 in 389023c
There was a problem hiding this comment.
If that is set by mistake, shall we validate that only one plugin can be activated for such interfaces during framework initialization ?
I agree, that's what I meant when I said "maybe we could do that during the runtime framework initialization.".
There was a problem hiding this comment.
@andreyvelich I've updated the logic for those "singleton" plugins so it fails-fast on at init time.
@andreyvelich I've added/updated integration and e2e tests. PTAL when you can. |
| } | ||
| if len(f.jobsStatusPlugins) != 0 { | ||
| return f.jobsStatusPlugins[0].JobsStatus(ctx, trainJob) |
There was a problem hiding this comment.
Also, make those plugins as not arrays here:
trainer/pkg/runtime/framework/core/framework.go
Lines 48 to 49 in 389023c
| Name: status.Name, | ||
| Ready: status.Ready, | ||
| Succeeded: status.Succeeded, | ||
| Failed: status.Failed, | ||
| Active: status.Active, | ||
| Suspended: status.Suspended, |
There was a problem hiding this comment.
I am wondering whether we will introduce a breaking change after this issue: kubernetes-sigs/jobset#723 ?
I proposed that we don't use Succeeded as Job condition in rJob.
Any thoughts @astefanutti @tenzen-y @kannon92 @kubeflow/kubeflow-trainer-team ?
There was a problem hiding this comment.
I agree this would be better to be consistent across projects, thought I'd suggest not to hold this PR as it'll be a breaking change anyway and have it addressed separately so we can clearly "release-note" it as a breaking change. WDYT?
There was a problem hiding this comment.
As we discussed offline in Slack, we will use Succeeded condition for now.
We will change this API to Complete in a future version, once JobSet migrates its APIs.
c25b925 to
3e4fc1a
Compare
|
/retest |
|
/milestone v2.1 |
| errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered") | ||
| errorTooManyJobsStatusPlugin = errors.New("too many JobsStatus plugins are registered") |
There was a problem hiding this comment.
Since we can only register one plugin for TerminalCondition and JobsStatus, do we need these errors ?
There was a problem hiding this comment.
That's actually the error that's returned at init time in case more than one plugin is being registered.
There was a problem hiding this comment.
But it won't be possible since we initialize the Framework here:
trainer/pkg/runtime/framework/core/framework.go
Lines 53 to 55 in 848a4bd
So this condition will never be executed:
trainer/pkg/runtime/framework/core/framework.go
Lines 82 to 85 in 848a4bd
Am I missing something ?
There was a problem hiding this comment.
Right and that's where the errors are returned.
There was a problem hiding this comment.
@astefanutti Can you explain the flow when f.terminalConditionPlugin != nil when we execute framework's New() function ?
There was a problem hiding this comment.
@tenzen-y I am wondering if there is any need to distinguish plugins that we define in the Registry and plugins that are part of those "registered" plugins (e.g. TerminalCondition plugin is part of JobSet plugin).
I could not catch the reason why you want to do that.
I think the current @astefanutti 's implementations are a straightforward way.
There was a problem hiding this comment.
@tenzen-y I was asking what does "Plugin" mean in the Extension Framework from your point of view ?
And how are we going to allow users to extend the framework with their own plugins ?
There was a problem hiding this comment.
Ah, got it. The trainer provides the 2 level abstraction layers which is Runtime Framework and Plugin.
But, sometimes, the plugin is only for a specific Runtime Framework like the JobSet plugin.
I guess that we might want to have mechanisms to represent the tie between the Runtime Framework and the Plugin, but it would be better to avoid adding any validation relationship to allow them to reuse the implemented Plugin wherever the Runtime Framework.
Anyway, we can consider that as an another enhancement.
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
| errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered") | ||
| errorTooManyJobsStatusPlugin = errors.New("too many JobsStatus plugins are registered") |
There was a problem hiding this comment.
@tenzen-y I am wondering if there is any need to distinguish plugins that we define in the Registry and plugins that are part of those "registered" plugins (e.g. TerminalCondition plugin is part of JobSet plugin).
I could not catch the reason why you want to do that.
I think the current @astefanutti 's implementations are a straightforward way.
| TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | ||
| JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) |
There was a problem hiding this comment.
| TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | |
| JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) | |
| UnderlyingJobState(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, []trainer.JobStatus, error) |
I think that TerminalCondition and JobsStatus responsibilities are almost same which is just propagate the Job state from underlying one to TrainJob.
There was a problem hiding this comment.
Make sense for me to consolidate those two plugins together.
There was a problem hiding this comment.
I've folded the two plugins together. PTAL.
| @@ -65,3 +65,8 @@ type TerminalConditionPlugin interface { | |||
| Plugin | |||
| TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | |||
| } | |||
|
|
|||
| type JobsStatusPlugin interface { | |||
| Plugin | |||
| JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) | |||
| } | |||
There was a problem hiding this comment.
ditto related to consolidation.
There was a problem hiding this comment.
Updated with the two plugins together. PTAL.
| // TODO (tenzen-y): Once we provide the Configuration API, we should validate which plugin should have terminalCondition execution points. | ||
| if len(f.terminalConditionPlugins) > 1 { |
There was a problem hiding this comment.
Any reason why we can remove the verification of the number of plugins?
There was a problem hiding this comment.
As we discussed here, users won't be able to register more than one plugin for terminal condition: #2802 (comment)
There was a problem hiding this comment.
Yes, the check is now moved at initialization, rather than at (first) execution.
There was a problem hiding this comment.
Ah, right. The new way makes more sense.
Thank you!
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
tenzen-y
left a comment
There was a problem hiding this comment.
LGTM
Thank you!
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ubeflow#2802) * feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Add integration tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update e2e tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Remove extra check Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Sort JobsStatus in e2e tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix e2e test for MPI job Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fail-fast when multiple terminal condition and JobsStatus plugins exist Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fold TerminalCondition and JobsStatus plugins Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…ubeflow#2802) * feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Add integration tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update e2e tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Remove extra check Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Sort JobsStatus in e2e tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix e2e test for MPI job Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fail-fast when multiple terminal condition and JobsStatus plugins exist Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fold TerminalCondition and JobsStatus plugins Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
What this PR does / why we need it:
This PR adds the JobsStatusPlugin runtime interface and implements it in the JobSet plugin so the TrainJob JobsStatus is synced from JobSet ReplicatedJobsStatus.
/cc @kubeflow/kubeflow-trainer-team @toVersus
Checklist: