feat(runtimes): add validation for reserved MPI environment variables#3491
Conversation
Signed-off-by: Vishal Painjane <painjanevishal2204@gmail.com>
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Adds server-side validation in the MPI runtime plugin to prevent users from setting OpenMPI environment variables that the operator manages internally, avoiding misconfiguration that can break MPI worker discovery.
Changes:
- Introduces
constants.MPIReservedEnvNames(set of OpenMPI-reserved env var names). - Extends MPI plugin
Validateto rejectTrainJob.spec.trainer.enventries that use any reserved name. - Adds unit tests covering single reserved env, multiple reserved envs, and a non-reserved env case.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
pkg/runtime/framework/plugins/mpi/mpi.go |
Rejects TrainJobs that set operator-managed OpenMPI env vars in spec.trainer.env. |
pkg/runtime/framework/plugins/mpi/mpi_test.go |
Adds validation test cases for reserved and non-reserved MPI env vars. |
pkg/constants/constants.go |
Defines the MPIReservedEnvNames set used by validation. |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this @adiprathapa!
/ok-to-test
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Picking up #3145 per @andreyvelich's ask. The original PR was approved in approach but went stale waiting on a rebase, so this is that rebase.
The MPI plugin's
Validatefunction already catches a few things; this adds a check that rejects anyTrainJob.spec.trainer.enventry whose name is in a newMPIReservedEnvNamesset. The set lives inpkg/constants/constants.goand covers the four OpenMPI variables the operator manages internally (OMPI_MCA_orte_default_hostfile,OMPI_MCA_plm_rsh_args,OMPI_MCA_orte_keep_fqdn_hostnames,OMPI_MCA_orte_set_default_slots). Approach mirrors the Torch plugin'sTorchRunReservedEnvNamescheck.Tests in
mpi_test.gocover one reserved var, two reserved vars, and a custom var that still passes.Rebase notes: minor conflicts in
constants.go(master addedXGBoostReservedEnvNamesat the same insertion point) andmpi.go(master dropped an unusedintstrimport).go vet,gofmt, and the package tests all pass locally.Fixes #3126
Co-authored-by: Vishal Painjane painjanevishal2204@gmail.com