feat(runtimes): implement clusterTrainingRuntime deprecation process#2791
Conversation
c5b0011 to
2a059ec
Compare
| LabelDeprecated string = "trainer.kubeflow.org/deprecated" | ||
|
|
||
| // DeprecatedTrueValue is the value indicating deprecation when used with LabelDeprecated. | ||
| DeprecatedTrueValue string = "true" |
There was a problem hiding this comment.
Almost a nit, but I wonder if this should be treated as a "flag" and only check for the label existence not its value. I don't think trainer.kubeflow.org/deprecated=false really makes sense in practice.
And to be sure that's future proof, are we foreseeing the need for other values in the lifecycle of runtimes like alpha, stable ...
There was a problem hiding this comment.
Sure, that make sense, but do you prefer that we use other label to identify lifecycle of a runtime (e.g. alpha, stable, deprecated) ?
There was a problem hiding this comment.
If we think it's the right way to go, yes I was thinking something like trainer.kubeflow.org/support="alpha|stable|deprecated"
There was a problem hiding this comment.
Thank you @astefanutti and @andreyvelich for the thoughtful input.
I also think that updating the label to something like trainer.kubeflow.org/support or trainer.kubeflow.org/lifecycle would be a more future-proof approach, as it better conveys the runtime’s lifecycle stage.
(Once we reach a consensus on the final label name, I’ll go ahead and update the PR accordingly and also open a separate PR to adjust the related documentation.)
There was a problem hiding this comment.
Do we see any advantages to introduce alpha? Official CTRs maintained by us are CustomResource, not CRD, which needs to maintain the API and display maturity level to users. And I think, the maturity of CRs is determined by CRD, since we always make the optimal design decision for the CR resource based on the APIs provided by CRD.
Given these facts, I think trainer.kubeflow.org/deprecated=true is enough. If CTRs does not have this label or have its value set to true, we consider it having the same maturity level with their CRD, ClusterTrainingRuntime. WDYT?
There was a problem hiding this comment.
I would distinguish maturity of CRD APIs and maturity of supported runtimes, since the CRD might be in the v1 stage, but the runtime is unstable and can introduce breaking changes. For example, we can decide to remove some packages from the runtime (e.g. transformers).
However, we can end up with never introducing the maturity level for the runtimes, if we don't see strong signal from users.
All in all, my suggestion is to keep trainer.kubeflow.org/support=deprecated label for now to make sure we can introduce other support levels in the future if that is required.
@tenzen-y Any thoughts ?
There was a problem hiding this comment.
I would also differentiate the maturity of the CRD APIs from that of the runtimes.
For the maturity levels of the runtimes, it's likely that downstream projects / companies do not use the upstream runtimes directly but productize their own runtimes and can define those levels according to their own policies.
So I'd also suggest to "standardize" the trainer.kubeflow.org/support label without enforcing values and let those projects / companies the freedom to set it freely.
For upstream we can start with trainer.kubeflow.org/support=deprecated.
There was a problem hiding this comment.
+1 on trainer.kubeflow.org/support=deprecated
There was a problem hiding this comment.
SGTM. Let's use trainer.kubeflow.org/support=deprecated
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you @tdn21!
/ok-to-test
| LabelDeprecated string = "trainer.kubeflow.org/deprecated" | ||
|
|
||
| // DeprecatedTrueValue is the value indicating deprecation when used with LabelDeprecated. | ||
| DeprecatedTrueValue string = "true" |
There was a problem hiding this comment.
Sure, that make sense, but do you prefer that we use other label to identify lifecycle of a runtime (e.g. alpha, stable, deprecated) ?
| testingutil "github.com/kubeflow/trainer/v2/pkg/util/testing" | ||
| ) | ||
|
|
||
| func TestClusterTrainingRuntime_ValidateCreate_DeprecatedWarning(t *testing.T) { |
There was a problem hiding this comment.
@tdn21 @astefanutti @tenzen-y @Electronic-Waste Do we prefer to create unit or integration test to verify it ?
https://github.com/kubeflow/trainer/blob/2a059ec3c93d3a09210a44d74677a53600576787/test/integration/webhooks/clustertrainingruntime_webhook_test.go
Maybe we should have both like for TrainJob webhook ?
There was a problem hiding this comment.
Agree, having both would be better. Let me add integration test as well.
There was a problem hiding this comment.
Thanks a lot for the suggestion @andreyvelich. I have added the integration tests.
Just a quick note:
- In the integration tests, we aren’t asserting warnings since controller-runtime’s client doesn’t expose admission warnings. These tests only validate that resource creation succeeds and isn’t blocked.
- The actual warning text is validated in the unit tests.
There was a problem hiding this comment.
Agree, having both would be better. Let me add integration test as well.
I don't think so. The webhook integration test for verification webhook behavior and marker validatiion.
The only UTs should be better.
There was a problem hiding this comment.
+1 on UTs. That will be enough:)
There was a problem hiding this comment.
Sounds good, thank you! 🙇
I've removed earlier added integration tests. (62f7931)
Pull Request Test Coverage Report for Build 18546363197Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
| testingutil "github.com/kubeflow/trainer/v2/pkg/util/testing" | ||
| ) | ||
|
|
||
| func TestClusterTrainingRuntime_ValidateCreate_DeprecatedWarning(t *testing.T) { |
There was a problem hiding this comment.
| func TestClusterTrainingRuntime_ValidateCreate_DeprecatedWarning(t *testing.T) { | |
| func ValidateCreate(t *testing.T) { |
The UTs should be for the top level exposed function. We should avoid implementing UTs each for action.
There was a problem hiding this comment.
@tdn21 Splitting with _ is python coding styling. In golang, we use camel style. And it will also be better to use a tidy and short name:)
| _, ctx := ktesting.NewTestContext(t) | ||
| ctx, cancel := context.WithCancel(ctx) | ||
| t.Cleanup(cancel) | ||
|
|
||
| obj := testingutil.MakeClusterTrainingRuntimeWrapper("test-runtime"). | ||
| RuntimeSpec(trainer.TrainingRuntimeSpec{ | ||
| Template: trainer.JobSetTemplateSpec{ | ||
| Spec: func() jobsetv1alpha2.JobSetSpec { | ||
| js := testingutil.MakeJobSetWrapper("", "") | ||
| js.Replicas(1, constants.DatasetInitializer, constants.ModelInitializer, constants.Node) | ||
| return js.Obj().Spec | ||
| }(), | ||
| }, | ||
| }).Obj() | ||
|
|
||
| if obj.Labels == nil { | ||
| obj.Labels = map[string]string{} | ||
| } | ||
| obj.Labels[constants.LabelDeprecated] = constants.DeprecatedTrueValue | ||
|
|
||
| validator := &ClusterTrainingRuntimeWebhook{} | ||
| warnings, err := validator.ValidateCreate(ctx, obj) | ||
| if err != nil { | ||
| t.Fatalf("unexpected error: %v", err) | ||
| } | ||
| var want admission.Warnings | ||
| want = append(want, "ClusterTrainingRuntime \"test-runtime\" is deprecated and will be removed in a future release of Kubeflow Trainer. See runtime deprecation policy: "+constants.RuntimeDeprecationPolicyURL) | ||
| if diff := cmp.Diff(want, warnings); diff != "" { | ||
| t.Fatalf("unexpected warnings (-want, +got): %s", diff) | ||
| } |
There was a problem hiding this comment.
It would be better to follow the another test case.
| Obj(), | ||
| wantError: nil, | ||
| wantWarnings: admission.Warnings{"Referenced ClusterTrainingRuntime \"test-runtime\" is deprecated and will be removed in a future release of Kubeflow Trainer. See runtime deprecation policy: " + constants.RuntimeDeprecationPolicyURL}, | ||
| deprecate: true, |
There was a problem hiding this comment.
| deprecate: true, | |
| clusterTrainingRuntime: ... |
Handling test data by the boolean is not good way. For more descriptive approach, we shoul define the clusterTrainingRuntime object.
| var warnings admission.Warnings | ||
| if clusterTrainingRuntime.Labels != nil { | ||
| if val, ok := clusterTrainingRuntime.Labels[constants.LabelDeprecated]; ok && val == constants.DeprecatedTrueValue { | ||
| warnings = append(warnings, fmt.Sprintf( | ||
| "Referenced ClusterTrainingRuntime \"%s\" is deprecated and will be removed in a future release of Kubeflow Trainer. See runtime deprecation policy: %s", | ||
| clusterTrainingRuntime.Name, | ||
| constants.RuntimeDeprecationPolicyURL, | ||
| )) | ||
| } | ||
| } |
There was a problem hiding this comment.
I don't think that we should have these duplicate validations.
There was a problem hiding this comment.
Thank you so much @tenzen-y and @Electronic-Waste for your feedback. 🙇
Could you please help me better understand what we mean by "duplicate validations" here?
Currently, we warn in two different scenarios:
- when creating a deprecated
ClusterTrainingRuntime(via webhook), and - when a
TrainJobreferences a deprecatedClusterTrainingRuntime(runtime validation via theTrainJobwebhook).
Since these are separate user touchpoints, I feel it makes sense to keep both. That said, I completely agree we should deduplicate the code by using a helper. (added it here)
There was a problem hiding this comment.
I think the "duplicate validations" means that you implement validation in both runtime plugin and webhook. Could we only examine the whether clustertrainingruntime is deprecated in webhook? Maybe that's enough:)
There was a problem hiding this comment.
My understanding is the warnings should only be emitted for TrainJobs that reference deprecated training runtimes, not for the creation of "deprecated" training runtimes themselves (which somewhat does not make much sense).
In that case, only the logic in the runtime is necessary, and the validation in the ClusterTrainingRuntime webhook is not necessary.
Electronic-Waste
left a comment
There was a problem hiding this comment.
@tdn21 Thanks for this. Can you follow up with this PR?
| LabelDeprecated string = "trainer.kubeflow.org/deprecated" | ||
|
|
||
| // DeprecatedTrueValue is the value indicating deprecation when used with LabelDeprecated. | ||
| DeprecatedTrueValue string = "true" |
There was a problem hiding this comment.
SGTM. Let's use trainer.kubeflow.org/support=deprecated
| var warnings admission.Warnings | ||
| if clusterTrainingRuntime.Labels != nil { | ||
| if val, ok := clusterTrainingRuntime.Labels[constants.LabelDeprecated]; ok && val == constants.DeprecatedTrueValue { | ||
| warnings = append(warnings, fmt.Sprintf( | ||
| "Referenced ClusterTrainingRuntime \"%s\" is deprecated and will be removed in a future release of Kubeflow Trainer. See runtime deprecation policy: %s", | ||
| clusterTrainingRuntime.Name, | ||
| constants.RuntimeDeprecationPolicyURL, | ||
| )) | ||
| } | ||
| } |
| testingutil "github.com/kubeflow/trainer/v2/pkg/util/testing" | ||
| ) | ||
|
|
||
| func TestClusterTrainingRuntime_ValidateCreate_DeprecatedWarning(t *testing.T) { |
There was a problem hiding this comment.
+1 on UTs. That will be enough:)
| testingutil "github.com/kubeflow/trainer/v2/pkg/util/testing" | ||
| ) | ||
|
|
||
| func TestClusterTrainingRuntime_ValidateCreate_DeprecatedWarning(t *testing.T) { |
There was a problem hiding this comment.
@tdn21 Splitting with _ is python coding styling. In golang, we use camel style. And it will also be better to use a tidy and short name:)
7439fc6 to
c880b0c
Compare
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
…ce constant Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
…ainingRuntime Create ClusterTrainingRuntime with deprecated label; assert creation succeeds (warning is non-blocking) Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
…terTrainingRuntime Mark ClusterTrainingRuntime deprecated; create TrainJob referencing it; assert creation succeeds (warning is non-blocking) Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
…lidateObjects Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
… webhooks Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
…on check Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
c880b0c to
c74ae97
Compare
| var warnings admission.Warnings | ||
| if labelutil.IsSupportDeprecated(clTrainingRuntime.Labels) { | ||
| warnings = append(warnings, fmt.Sprintf( | ||
| "ClusterTrainingRuntime \"%s\" is deprecated and will be removed in a future release of Kubeflow Trainer. See runtime deprecation policy: %s", | ||
| clTrainingRuntime.Name, | ||
| constants.RuntimeDeprecationPolicyURL, | ||
| )) | ||
| } | ||
| return warnings, validateReplicatedJobs(clTrainingRuntime.Spec.Template.Spec.ReplicatedJobs).ToAggregate() |
There was a problem hiding this comment.
This is not necessary. Warnings should only be emitted for TrainJobs referencing deprecated runtimes, not for the creation of ClusterTrainingRuntimes themselves.
There was a problem hiding this comment.
@astefanutti According to our policy, we also show an error when user creates a deprecated Runtime: https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime/#runtime-deprecation-policy
Do you prefer that we don't do this ?
There was a problem hiding this comment.
@andreyvelich ah you're right, I missed this. That makes sense.
| @@ -0,0 +1,12 @@ | |||
| package labels | |||
There was a problem hiding this comment.
Maybe:
| package labels | |
| package support |
There was a problem hiding this comment.
I would prefer we call this package as runtime, similar to trainjob utils: https://github.com/kubeflow/trainer/blob/master/pkg/util/trainjob/trainjob.go
|
@tdn21 Did you get a chance to address the remaining comments ? |
|
ping @tdn21, we are planning to release |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
@astefanutti @Electronic-Waste @tdn21 @tenzen-y I've updated the package according to this comment: #2791 (comment) |
|
/lgtm Thanks @tdn21 and everyone! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ubeflow#2791) * constants: add deprecation label constants for runtimes Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * webhooks: warn on creation of deprecated ClusterTrainingRuntime Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * runtime: warn when TrainJob references deprecated ClusterTrainingRuntime Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests: add unit test coverage for deprecation warnings Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * webhooks/runtime: update deprecation warning wording and tests Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * constants/webhooks/runtime: centralize runtime policy URL and reference constant Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(integration/webhooks): add create case for deprecated ClusterTrainingRuntime Create ClusterTrainingRuntime with deprecated label; assert creation succeeds (warning is non-blocking) Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(integration/webhooks): add TrainJob create with deprecated ClusterTrainingRuntime Mark ClusterTrainingRuntime deprecated; create TrainJob referencing it; assert creation succeeds (warning is non-blocking) Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests/runtime: cover deprecation warning in ClusterTrainingRuntime.ValidateObjects Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(integration): remove deprecation-warning cases for CTR/TrainJob webhooks Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * feat(runtimes): switch to support=deprecated and centralize deprecation check Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(webhooks): focus UTs on entrypoints and clean up tests/test data Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * Rename package from labels to trainingruntime Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ubeflow#2791) * constants: add deprecation label constants for runtimes Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * webhooks: warn on creation of deprecated ClusterTrainingRuntime Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * runtime: warn when TrainJob references deprecated ClusterTrainingRuntime Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests: add unit test coverage for deprecation warnings Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * webhooks/runtime: update deprecation warning wording and tests Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * constants/webhooks/runtime: centralize runtime policy URL and reference constant Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(integration/webhooks): add create case for deprecated ClusterTrainingRuntime Create ClusterTrainingRuntime with deprecated label; assert creation succeeds (warning is non-blocking) Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(integration/webhooks): add TrainJob create with deprecated ClusterTrainingRuntime Mark ClusterTrainingRuntime deprecated; create TrainJob referencing it; assert creation succeeds (warning is non-blocking) Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests/runtime: cover deprecation warning in ClusterTrainingRuntime.ValidateObjects Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(integration): remove deprecation-warning cases for CTR/TrainJob webhooks Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * feat(runtimes): switch to support=deprecated and centralize deprecation check Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * tests(webhooks): focus UTs on entrypoints and clean up tests/test data Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> * Rename package from labels to trainingruntime Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Tarun Duhan <itarunduhan@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
What this PR does
Implements the ClusterTrainingRuntime deprecation policy by surfacing admission warnings when:
ClusterTrainingRuntimelabeledtrainer.kubeflow.org/support=deprecated.TrainJobthat references a deprecatedClusterTrainingRuntime.Why we need it
This makes users aware of deprecated runtimes at the time of creating ClusterTrainingRuntimes or TrainJobs per the runtime deprecation policy:
https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime/#runtime-deprecation-policy
Which issue(s) this PR fixes:
Fixes #2789
Checklist:
Notes