Skip to content

Move help text output regarding PSM2_CUDA environment variable #4323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 27, 2017

Conversation

aravindksg
Copy link
Contributor

The messages should be printed only in the event of CUDA builds and when
PSM2 MTL has actually been selected for use. To this end,
move help text output to component init phase.

Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment
variable and sanitize the help text message.

Signed-off-by: Aravind Gopalakrishnan [email protected]

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@jjhursey
Copy link
Member

test this please (this phrase will trigger our CI full testing)

@open-mpi open-mpi deleted a comment from ibm-ompi Oct 10, 2017
@jjhursey
Copy link
Member

bot:ibm:pgi:retest

@@ -334,6 +312,10 @@ ompi_mtl_psm2_component_query(mca_base_module_t **module, int *priority)
static int
ompi_mtl_psm2_component_close(void)
{
#if OPAL_CUDA_SUPPORT
if (cuda_envvar_set)
opal_unsetenv("PSM2_CUDA", &environ);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: please use {}, even for 1-line blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this

* to set it.
*/
cuda_env = getenv("PSM2_CUDA");
if (!cuda_env) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll repeat what I said in the last PR: don't you want to check and see if there are CUDA devices present before you do this? Just being compiled for CUDA support is not any kind of guarantee that there are CUDA devices present.

Indeed, if someone uses an Open MPI compiled with CUDA support (e.g., via OpenHPC) in an environment without CUDA devices, it will be quite confusing to them as to why they are getting help messages about CUDA support (because it won't matter at all).

Run-time sensing what devices are present is a big theme of Open MPI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll add some CUDA device detection here. Can I simply issue lspci -nn | grep -i "10de"

and check the return value? Any other suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that will not scale (imagine N processes on a server all executing that at once).

Copy link
Member

@jsquyres jsquyres Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We literally just disabled hwloc CUDA device detection in master (c341b53) per #4257 (comment). @sjeaugey is this a good reason to turn it back on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some CUDA device detection logic in the updated patch.

without enabling CUDA support on PSM2 library. Open MPI has therefore defaulted
to setting PSM2_CUDA=1. This may impact performance if NOT running CUDA aware
applications. Set your environment with variable PSM2_CUDA equal to 1 to clear
this message, or set it to 0 to hint PSM2 that no CUDA support is needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweak suggestions for this message (per below, I'm assuming you'll only emit this message when there are CUDA devices present):

Warning: Open MPI has detected that you are running in an environment with CUDA devices present and that you are using Intel(r) Ompi-Path networking. However, the environment variable PSM2_CUDA was not set, meaning that the PSM2 Omni-Path networking library was not told how to handle CUDA support.

If your application uses CUDA buffers, you should set the environment variable PSM2_CUDA to 1; otherwise, set it to 0. Setting the variable to the wrong value can have performance implications on your application, or even cause it to crash.

Since it was not set, Open MPI has defaulted to setting the PSM2_CUDA environment variable to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Will fix this

} else if (strcmp(cuda_env, "0") == 0) {
opal_show_help("help-mtl-psm2.txt",
"psm2 cuda env zero", true,
ompi_process_info.nodename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're showing a help message:

  • If PSM2_CUDA was not set
  • If PSM2_CUDA was set to 0

The 2nd message there is... a bit weird. So even if I (a user) am doing what I am supposed to be doing (i.e., telling you that I am not using CUDA buffers in my application), you're going to emit a warning message. If you really want to do that, feel free -- this is Intel's plugin. But I think that's downright weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent was to inform user that only host buffers will work with PSM2_CUDA=0. I'll remove it. Should be reasonable to assume user will already know that.

#if OPAL_CUDA_SUPPORT
int ret;
char *cuda_env;
glob_t globbuf;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to #include <glob.h> to guarantee that this will work.

Is PSM/PSM2 Linux-only? I.e., do you need to add a test for glob() and/or <glob.h> to this component's configure.ac?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glob.h is already included. (There is an existing usage of glob() in line 279). PSM2 is Linux only and glob.h is included in glibc-headers package. So, don't think we need a check for this in configure.ac

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops -- missed that. You're right.

@@ -389,6 +378,27 @@ ompi_mtl_psm2_component_init(bool enable_progress_threads,
ompi_mtl_psm2_set_shadow_env (ompi_mtl_psm2_shadow_variables + i);
}

#if OPAL_CUDA_SUPPORT
/*
* If using CUDA enabled OpenMPI, the user likely intends to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Open MPI" -- 2 words. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about that. I thought I fixed this.

int ret;
char *cuda_env;
glob_t globbuf;
globbuf.gl_offs = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to declare glob_t globbuf = {0} to just guarantee that the entire instance is zeroed out? This might also make freeing the globbuf memory easier later, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

* to set it.
*/
ret = glob("/sys/module/nvidia", GLOB_DOOFFS, NULL, &globbuf);
if (0 == ret || GLOB_NOMATCH == ret) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call globfree() if NOMATCH was returned? Doesn't NOMATCH mean that glob.gl_pathv is empty?

I have not used glob() before, so I don't know exactly how it behaves - should you check glob.gl_pathv for non-NULL (particularly if you initialize globbuf.gl_pathv with {0}, above) to know if you need to call globfree()?

Copy link
Contributor Author

@aravindksg aravindksg Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we set GLOB_NOCHECK flag, I think we can expect globbuf.gl_pathv to be empty if the pattern didn't match. (I verified this on a sandbox too). So, I'll modify this to check:

if (globbuf.gl_argc > 0) {
  globfree(&globbuf)
}

FYI- The current check I have was basically following the existing usage in lines 259, 264. Which itself seems to have been introduced in commit 1daa80d (plug a memory leak in ompi_mtl_psm2_component_open). I don't think this commit was cherry-picked to other branches though. If this is fixing a memory leak, should this commit be ported to v2.x, v3.0.x?

While at it, shall I modify the above usage to also initialize the entire struct to {0} and to globfree() only if (globbuf.gl_argc > 0)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer is "yes" to all your questions:

  • Cherry pick as relevant to v2.x, v3.0.x, and v3.x (since this is fixing a [minor] bug).
    • Skip v2.0.x -- that series is effectively dead. We're really only taking serious bug fixes to that series now (this minor memory leak is not serious enough).
  • I like your idea of initializing the entire struct with {0} and only globfree()ing if globbuf.gl_argc > 0.

Copy link
Contributor Author

@aravindksg aravindksg Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this 👍

@@ -45,6 +46,10 @@ static int param_priority;
/* MPI_THREAD_MULTIPLE_SUPPORT */
opal_mutex_t mtl_psm2_mq_mutex = OPAL_MUTEX_STATIC_INIT;

#if OPAL_CUDA_SUPPORT
static int cuda_envvar_set;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use bool here -- Open MPI requires a C99 compiler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do that.

The messages should be printed only in the event of CUDA builds and in the
presence of supporting hardware and when PSM2 MTL has actually been selected
for use. To this end, move help text output to component init phase.

Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment
variable and sanitize the help text message.

Signed-off-by: Aravind Gopalakrishnan <[email protected]>
@jsquyres
Copy link
Member

ok to test

@rhc54 rhc54 merged commit df48ddd into open-mpi:master Oct 27, 2017
@aravindksg
Copy link
Contributor Author

Thanks @rhc54

@aravindksg aravindksg deleted the fix_help_text branch November 6, 2017 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants