-
Notifications
You must be signed in to change notification settings - Fork 903
Move help text output regarding PSM2_CUDA environment variable #4323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
test this please (this phrase will trigger our CI full testing) |
bot:ibm:pgi:retest |
@@ -334,6 +312,10 @@ ompi_mtl_psm2_component_query(mca_base_module_t **module, int *priority) | |||
static int | |||
ompi_mtl_psm2_component_close(void) | |||
{ | |||
#if OPAL_CUDA_SUPPORT | |||
if (cuda_envvar_set) | |||
opal_unsetenv("PSM2_CUDA", &environ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: please use {}
, even for 1-line blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this
* to set it. | ||
*/ | ||
cuda_env = getenv("PSM2_CUDA"); | ||
if (!cuda_env) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll repeat what I said in the last PR: don't you want to check and see if there are CUDA devices present before you do this? Just being compiled for CUDA support is not any kind of guarantee that there are CUDA devices present.
Indeed, if someone uses an Open MPI compiled with CUDA support (e.g., via OpenHPC) in an environment without CUDA devices, it will be quite confusing to them as to why they are getting help messages about CUDA support (because it won't matter at all).
Run-time sensing what devices are present is a big theme of Open MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll add some CUDA device detection here. Can I simply issue lspci -nn | grep -i "10de"
and check the return value? Any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that will not scale (imagine N processes on a server all executing that at once).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We literally just disabled hwloc CUDA device detection in master (c341b53) per #4257 (comment). @sjeaugey is this a good reason to turn it back on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some CUDA device detection logic in the updated patch.
ompi/mca/mtl/psm2/help-mtl-psm2.txt
Outdated
without enabling CUDA support on PSM2 library. Open MPI has therefore defaulted | ||
to setting PSM2_CUDA=1. This may impact performance if NOT running CUDA aware | ||
applications. Set your environment with variable PSM2_CUDA equal to 1 to clear | ||
this message, or set it to 0 to hint PSM2 that no CUDA support is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor tweak suggestions for this message (per below, I'm assuming you'll only emit this message when there are CUDA devices present):
Warning: Open MPI has detected that you are running in an environment with CUDA devices present and that you are using Intel(r) Ompi-Path networking. However, the environment variable PSM2_CUDA was not set, meaning that the PSM2 Omni-Path networking library was not told how to handle CUDA support.
If your application uses CUDA buffers, you should set the environment variable PSM2_CUDA to 1; otherwise, set it to 0. Setting the variable to the wrong value can have performance implications on your application, or even cause it to crash.
Since it was not set, Open MPI has defaulted to setting the PSM2_CUDA environment variable to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Will fix this
} else if (strcmp(cuda_env, "0") == 0) { | ||
opal_show_help("help-mtl-psm2.txt", | ||
"psm2 cuda env zero", true, | ||
ompi_process_info.nodename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're showing a help message:
- If
PSM2_CUDA
was not set - If
PSM2_CUDA
was set to 0
The 2nd message there is... a bit weird. So even if I (a user) am doing what I am supposed to be doing (i.e., telling you that I am not using CUDA buffers in my application), you're going to emit a warning message. If you really want to do that, feel free -- this is Intel's plugin. But I think that's downright weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intent was to inform user that only host buffers will work with PSM2_CUDA=0
. I'll remove it. Should be reasonable to assume user will already know that.
63e4e8d
to
fb0d32b
Compare
#if OPAL_CUDA_SUPPORT | ||
int ret; | ||
char *cuda_env; | ||
glob_t globbuf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to #include <glob.h>
to guarantee that this will work.
Is PSM/PSM2 Linux-only? I.e., do you need to add a test for glob()
and/or <glob.h>
to this component's configure.ac
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glob.h
is already included. (There is an existing usage of glob()
in line 279). PSM2 is Linux only and glob.h
is included in glibc-headers package. So, don't think we need a check for this in configure.ac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops -- missed that. You're right.
@@ -389,6 +378,27 @@ ompi_mtl_psm2_component_init(bool enable_progress_threads, | |||
ompi_mtl_psm2_set_shadow_env (ompi_mtl_psm2_shadow_variables + i); | |||
} | |||
|
|||
#if OPAL_CUDA_SUPPORT | |||
/* | |||
* If using CUDA enabled OpenMPI, the user likely intends to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Open MPI" -- 2 words. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about that. I thought I fixed this.
int ret; | ||
char *cuda_env; | ||
glob_t globbuf; | ||
globbuf.gl_offs = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to declare glob_t globbuf = {0}
to just guarantee that the entire instance is zeroed out? This might also make freeing the globbuf
memory easier later, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
* to set it. | ||
*/ | ||
ret = glob("/sys/module/nvidia", GLOB_DOOFFS, NULL, &globbuf); | ||
if (0 == ret || GLOB_NOMATCH == ret) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call globfree()
if NOMATCH
was returned? Doesn't NOMATCH
mean that glob.gl_pathv
is empty?
I have not used glob()
before, so I don't know exactly how it behaves - should you check glob.gl_pathv
for non-NULL (particularly if you initialize globbuf.gl_pathv
with {0}
, above) to know if you need to call globfree()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we set GLOB_NOCHECK
flag, I think we can expect globbuf.gl_pathv
to be empty if the pattern didn't match. (I verified this on a sandbox too). So, I'll modify this to check:
if (globbuf.gl_argc > 0) {
globfree(&globbuf)
}
FYI- The current check I have was basically following the existing usage in lines 259, 264. Which itself seems to have been introduced in commit 1daa80d (plug a memory leak in ompi_mtl_psm2_component_open). I don't think this commit was cherry-picked to other branches though. If this is fixing a memory leak, should this commit be ported to v2.x, v3.0.x?
While at it, shall I modify the above usage to also initialize the entire struct to {0}
and to globfree()
only if (globbuf.gl_argc > 0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer is "yes" to all your questions:
- Cherry pick as relevant to v2.x, v3.0.x, and v3.x (since this is fixing a [minor] bug).
- Skip v2.0.x -- that series is effectively dead. We're really only taking serious bug fixes to that series now (this minor memory leak is not serious enough).
- I like your idea of initializing the entire struct with
{0}
and onlyglobfree()
ing ifglobbuf.gl_argc > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this 👍
@@ -45,6 +46,10 @@ static int param_priority; | |||
/* MPI_THREAD_MULTIPLE_SUPPORT */ | |||
opal_mutex_t mtl_psm2_mq_mutex = OPAL_MUTEX_STATIC_INIT; | |||
|
|||
#if OPAL_CUDA_SUPPORT | |||
static int cuda_envvar_set; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use bool
here -- Open MPI requires a C99 compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do that.
fb0d32b
to
bea4503
Compare
The messages should be printed only in the event of CUDA builds and in the presence of supporting hardware and when PSM2 MTL has actually been selected for use. To this end, move help text output to component init phase. Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment variable and sanitize the help text message. Signed-off-by: Aravind Gopalakrishnan <[email protected]>
ok to test |
Thanks @rhc54 |
The messages should be printed only in the event of CUDA builds and when
PSM2 MTL has actually been selected for use. To this end,
move help text output to component init phase.
Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment
variable and sanitize the help text message.
Signed-off-by: Aravind Gopalakrishnan [email protected]