Move help text output regarding PSM2_CUDA environment variable #4323

aravindksg · 2017-10-10T19:19:38Z

The messages should be printed only in the event of CUDA builds and when
PSM2 MTL has actually been selected for use. To this end,
move help text output to component init phase.

Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment
variable and sanitize the help text message.

Signed-off-by: Aravind Gopalakrishnan [email protected]

ompiteam-bot · 2017-10-10T19:19:41Z

Can one of the admins verify this patch?

jjhursey · 2017-10-10T19:21:04Z

test this please (this phrase will trigger our CI full testing)

jjhursey · 2017-10-10T20:25:17Z

bot:ibm:pgi:retest

jsquyres · 2017-10-10T20:43:35Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

@@ -334,6 +312,10 @@ ompi_mtl_psm2_component_query(mca_base_module_t **module, int *priority)
 static int
 ompi_mtl_psm2_component_close(void)
 {
+#if OPAL_CUDA_SUPPORT
+    if (cuda_envvar_set)
+        opal_unsetenv("PSM2_CUDA", &environ);


Minor nit: please use {}, even for 1-line blocks.

jsquyres · 2017-10-10T20:45:11Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

+     * to set it.
+     */
+    cuda_env = getenv("PSM2_CUDA");
+    if (!cuda_env) {


I'll repeat what I said in the last PR: don't you want to check and see if there are CUDA devices present before you do this? Just being compiled for CUDA support is not any kind of guarantee that there are CUDA devices present.

Indeed, if someone uses an Open MPI compiled with CUDA support (e.g., via OpenHPC) in an environment without CUDA devices, it will be quite confusing to them as to why they are getting help messages about CUDA support (because it won't matter at all).

Run-time sensing what devices are present is a big theme of Open MPI.

Ok, I'll add some CUDA device detection here. Can I simply issue lspci -nn | grep -i "10de"

and check the return value? Any other suggestions?

No, that will not scale (imagine N processes on a server all executing that at once).

We literally just disabled hwloc CUDA device detection in master (c341b53) per #4257 (comment). @sjeaugey is this a good reason to turn it back on?

Added some CUDA device detection logic in the updated patch.

jsquyres · 2017-10-10T20:48:57Z

ompi/mca/mtl/psm2/help-mtl-psm2.txt

+without enabling CUDA support on PSM2 library. Open MPI has therefore defaulted
+to setting PSM2_CUDA=1. This may impact performance if NOT running CUDA aware
+applications. Set your environment with variable PSM2_CUDA equal to 1 to clear
+this message, or set it to 0 to hint PSM2 that no CUDA support is needed.


Minor tweak suggestions for this message (per below, I'm assuming you'll only emit this message when there are CUDA devices present):

Warning: Open MPI has detected that you are running in an environment with CUDA devices present and that you are using Intel(r) Ompi-Path networking. However, the environment variable PSM2_CUDA was not set, meaning that the PSM2 Omni-Path networking library was not told how to handle CUDA support.

If your application uses CUDA buffers, you should set the environment variable PSM2_CUDA to 1; otherwise, set it to 0. Setting the variable to the wrong value can have performance implications on your application, or even cause it to crash.

Since it was not set, Open MPI has defaulted to setting the PSM2_CUDA environment variable to 1.

Ok, Will fix this

jsquyres · 2017-10-10T20:52:46Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

+    } else if (strcmp(cuda_env, "0") == 0) {
+        opal_show_help("help-mtl-psm2.txt",
+                       "psm2 cuda env zero", true,
+                       ompi_process_info.nodename);


So you're showing a help message:

If PSM2_CUDA was not set

If PSM2_CUDA was set to 0

The 2nd message there is... a bit weird. So even if I (a user) am doing what I am supposed to be doing (i.e., telling you that I am not using CUDA buffers in my application), you're going to emit a warning message. If you really want to do that, feel free -- this is Intel's plugin. But I think that's downright weird.

Intent was to inform user that only host buffers will work with PSM2_CUDA=0. I'll remove it. Should be reasonable to assume user will already know that.

jsquyres · 2017-10-26T14:47:55Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

+#if OPAL_CUDA_SUPPORT
+    int ret;
+    char *cuda_env;
+    glob_t globbuf;


You probably need to #include <glob.h> to guarantee that this will work.

Is PSM/PSM2 Linux-only? I.e., do you need to add a test for glob() and/or <glob.h> to this component's configure.ac?

glob.h is already included. (There is an existing usage of glob() in line 279). PSM2 is Linux only and glob.h is included in glibc-headers package. So, don't think we need a check for this in configure.ac

Oops -- missed that. You're right.

jsquyres · 2017-10-26T14:48:19Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

@@ -389,6 +378,27 @@ ompi_mtl_psm2_component_init(bool enable_progress_threads,
        ompi_mtl_psm2_set_shadow_env (ompi_mtl_psm2_shadow_variables + i);
    }

+#if OPAL_CUDA_SUPPORT
+    /*
+     * If using CUDA enabled OpenMPI, the user likely intends to


"Open MPI" -- 2 words. 😄

sorry about that. I thought I fixed this.

jsquyres · 2017-10-26T14:52:30Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

+    int ret;
+    char *cuda_env;
+    glob_t globbuf;
+    globbuf.gl_offs = 0;


Do you want to declare glob_t globbuf = {0} to just guarantee that the entire instance is zeroed out? This might also make freeing the globbuf memory easier later, too.

Sure, will do.

jsquyres · 2017-10-26T14:53:46Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

+     * to set it.
+     */
+    ret = glob("/sys/module/nvidia", GLOB_DOOFFS, NULL, &globbuf);
+    if (0 == ret ||  GLOB_NOMATCH == ret) {


Why call globfree() if NOMATCH was returned? Doesn't NOMATCH mean that glob.gl_pathv is empty?

I have not used glob() before, so I don't know exactly how it behaves - should you check glob.gl_pathv for non-NULL (particularly if you initialize globbuf.gl_pathv with {0}, above) to know if you need to call globfree()?

Unless we set GLOB_NOCHECK flag, I think we can expect globbuf.gl_pathv to be empty if the pattern didn't match. (I verified this on a sandbox too). So, I'll modify this to check:

if (globbuf.gl_argc > 0) { globfree(&globbuf) }

FYI- The current check I have was basically following the existing usage in lines 259, 264. Which itself seems to have been introduced in commit 1daa80d (plug a memory leak in ompi_mtl_psm2_component_open). I don't think this commit was cherry-picked to other branches though. If this is fixing a memory leak, should this commit be ported to v2.x, v3.0.x?

While at it, shall I modify the above usage to also initialize the entire struct to {0} and to globfree() only if (globbuf.gl_argc > 0)?

I think the answer is "yes" to all your questions:

Cherry pick as relevant to v2.x, v3.0.x, and v3.x (since this is fixing a [minor] bug).

Skip v2.0.x -- that series is effectively dead. We're really only taking serious bug fixes to that series now (this minor memory leak is not serious enough).

I like your idea of initializing the entire struct with {0} and only globfree()ing if globbuf.gl_argc > 0.

Fixed this 👍

jsquyres · 2017-10-26T14:54:24Z

ompi/mca/mtl/psm2/mtl_psm2_component.c

@@ -45,6 +46,10 @@ static int param_priority;
 /* MPI_THREAD_MULTIPLE_SUPPORT */
 opal_mutex_t mtl_psm2_mq_mutex = OPAL_MUTEX_STATIC_INIT;

+#if OPAL_CUDA_SUPPORT
+static int cuda_envvar_set;


You can use bool here -- Open MPI requires a C99 compiler.

Ok, will do that.

The messages should be printed only in the event of CUDA builds and in the presence of supporting hardware and when PSM2 MTL has actually been selected for use. To this end, move help text output to component init phase. Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment variable and sanitize the help text message. Signed-off-by: Aravind Gopalakrishnan <[email protected]>

jsquyres · 2017-10-26T23:01:24Z

ok to test

aravindksg · 2017-10-27T15:41:26Z

Thanks @rhc54

open-mpi deleted a comment from ibm-ompi Oct 10, 2017

jsquyres requested changes Oct 10, 2017

View reviewed changes

aravindksg force-pushed the fix_help_text branch from 63e4e8d to fb0d32b Compare October 25, 2017 18:54

jsquyres requested changes Oct 26, 2017

View reviewed changes

aravindksg force-pushed the fix_help_text branch from fb0d32b to bea4503 Compare October 26, 2017 22:55

jsquyres approved these changes Oct 26, 2017

View reviewed changes

rhc54 merged commit df48ddd into open-mpi:master Oct 27, 2017

aravindksg mentioned this pull request Oct 30, 2017

Add support for GPU buffers for PSM2 MTL #4172

Merged

aravindksg deleted the fix_help_text branch November 6, 2017 19:50

Move help text output regarding PSM2_CUDA environment variable #4323

Move help text output regarding PSM2_CUDA environment variable #4323

Uh oh!

Conversation

aravindksg commented Oct 10, 2017

Uh oh!

ompiteam-bot commented Oct 10, 2017

Uh oh!

jjhursey commented Oct 10, 2017

Uh oh!

jjhursey commented Oct 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsquyres Oct 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aravindksg Oct 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aravindksg Oct 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsquyres commented Oct 26, 2017

Uh oh!

aravindksg commented Oct 27, 2017

Uh oh!

Uh oh!

jsquyres Oct 11, 2017 •

edited

Loading

aravindksg Oct 26, 2017 •

edited

Loading

aravindksg Oct 26, 2017 •

edited

Loading