-
Notifications
You must be signed in to change notification settings - Fork 903
Add support for GPU buffers for PSM2 MTL #4172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
The PR message mentions a conflict on opal/datatype/opal_convertor.h. I don't see any ... |
@bosilca , I had resolved the conflicts before committing the changes. There were two complaints from git:
I left out this macro definition to resolve the conflict resulting from the cherry-pick. |
ok to test |
@jsquyres unfortunately for now we don't know what causes depcomp to be missing. We'll try to keep an eye on this failure. |
But I saw this
Which may be a clue. Will try to investigate. |
@artpol84 Cool. I think we saw this same symptom quite a while ago -- the conjecture at the time was that somehow there were directory trees being re-used by different Jenkins builds. Or somehow the autotools themselves were being re-installed behind the scenes (i.e., outside of Jenkins), which caused this race condition. Good luck. |
@jsquyres this should be fixed now, please let me know if this will be observed again. |
@artpol84 Great, thank you! Just curious (since this went on for so long) -- what was the issue? Re-run on the Cray to fix the opal_path_nfs test issue: bot:ompi:retest |
@jsquyres it was infrastructure-related. |
Looks like the Cray went offline during the test... bot:ompi:retest |
Ping @bosilca : could you please Ack this PR as well? The code has already been reviewed and merged to master and v3.0.x branch (PR's 4143 and 4171) |
The RMs for the 2.1.x release are reluctant to take this feature unless its fixing a correctness bug. |
Hi @hppritcha, this is not a bug fix, but a new feature. Without this patch OMPI does not support the GPU Direct functionality in PSM2 library. |
Ok, I'm correcting the label to a performance bug and retracting my previous comment about new feature, and here is the rationale: CUDA support is already part of OMPI. Without this fix OMPI built with CUDA support runs significantly slower on a CUDA aware PSM2 library. |
ompi/mca/mtl/psm2/mtl_psm2.c
Outdated
cuda_env = getenv("PSM2_CUDA"); | ||
if (!cuda_env || ( strcmp(cuda_env, "0") == 0) ) | ||
opal_output(0, "Warning: If running with device buffers, there is a" | ||
" chance the application might fail. Try setting PSM2_CUDA=1.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use opal_show_help()
for these kinds of messages (because opal_show_help()
deduplicates such messages when outputting to mpirun
). There's also very little context given in this message to inform the user that the message is being emitted from the Open MPI PSM2 code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will fix this
ompi/mca/mtl/psm2/mtl_psm2.c
Outdated
ompi_mtl_psm2.super.mtl_flags |= MCA_MTL_BASE_FLAG_CUDA_INIT_DISABLE; | ||
|
||
cuda_env = getenv("PSM2_CUDA"); | ||
if (!cuda_env || ( strcmp(cuda_env, "0") == 0) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is PSM2_CUDA
an environment variable that is exported by the PSM2 library? Or is that an env variable that is solely being used by Open MPI? If it is solely used by the PSM2 support in Open MPI, it should be an MCA variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an envvar exported by PSM2 library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is actually not "exported", but "read" by PSM2.
ompi/mca/pml/cm/pml_cm_recvreq.h
Outdated
opal_convertor_copy_and_prepare_for_recv( \ | ||
ompi_mpi_local_convertor, \ | ||
&(datatype->super), \ | ||
count, \ | ||
addr, \ | ||
0, \ | ||
flags, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: since the rest of the code here is scrupulously pretty, you might want to re-indent the \
at the end of this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
ompi/mca/pml/cm/pml_cm_recvreq.h
Outdated
@@ -153,6 +158,7 @@ do { \ | |||
datatype, \ | |||
addr, \ | |||
count, \ | |||
flags, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto with above -- you might want to re-indent flags
to match the others (beware of tabs vs. spaces -- Open MPI style requires spaces and bans tabs). This happens a few other places in this PR, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix the indentation. Thought I converted tabs to spaces, but I'll double check.
@jsquyres Since master and v3.0.x branch already pulled these changes in, shall I generate separate PR's to fix the warning message and indentation issues there? |
Do a separate PR for master to make the changes. Once that PR is merged in, you can cherry pick the commit(s) from that master PR to this PR (i.e., just add the commits here -- no need to make a 2nd PR). Make sense? |
@jsquyres , a patch to address your concerns has been merged to master and cherry-picked here. Could you also please review? |
ompi/mca/mtl/psm2/help-mtl-psm2.txt
Outdated
@@ -45,3 +45,7 @@ Unknown path record query mechanism %s. Supported mechanisms are %s. | |||
# | |||
[message too big] | |||
Message size %llu bigger than supported by PSM2 API. Max = %llu | |||
# | |||
[no psm2 cuda env] | |||
Using CUDA enabled OpenMPI but PSM2_CUDA environment variable is %s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open MPI is two words, not one.
Also, please use full sentences and good grammar in these messages. These are meant to be formal help messages displayed to the user, not "good enough" messages that are only intended for developers. Also, the way you are adding an additional lengthy string via the call to opal_show_help()
is... unconventional.
More below, where you actually call opal_show_help()
.
"Host buffers,\nthere will be a performance penalty" | ||
" due to OMPI force setting this variable now.\n" | ||
"Set environment variable to 0 if using Host buffers" ); | ||
setenv("PSM2_CUDA", "1", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this lead to a segv if you put the "1" string into the environment, but then the PSM2 MTL is dlclose()
d (because the PSM2 MTL was not selected)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a harmless setting. I tested this by increasing OFI MTL priority to be higher than PSM2 and I did see the environment variable being set. But the application itself worked fine. Also- PSM2_DEVICES envvar setting is now done in component registration phase (PR #3834).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure it's harmless. I think you may be leaving now-unallocated memory in the environ
array when the PSM MTL is unloaded. It probably won't cause problems, but if anyone goes to try to read it -- especially if that virtual address is no longer valid -- it may cause problems. You might be able to test that a little more rigorously (this is one reason that passing values around in the environment is kinda sucky).
If I'm right, it may be best to strdup()
the "1"
, even though that technically makes a minor memory leak.
And/or you might defer this setenv until you know that PSM2 is going to be used. Then you will have a much smaller window for the bogus memory pointer in environ
to cause a problem (i.e., because it won't become bogus until the PSM MTL is closed during MPI_FINALIZE
, but that's usually on the way out of the process, anyway).
"not set", | ||
"Host buffers,\nthere will be a performance penalty" | ||
" due to OMPI force setting this variable now.\n" | ||
"Set environment variable to 0 if using Host buffers" ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use "OMPI" in a show help string -- the product name is "Open MPI" (two words).
I would suggest that you should have 2 different show_help messages here, rather than feeding in a lengthy string to opal_show_help()
.
Additionally -- this message is being displayed during the component registration phase. Isn't that far too early? I.e., won't it display on systems that don't even have PSM2 and/or CUDA hardware available?
I would think that you should only display these messages if a) the PSM2 MTL is actually selected and b) there are CUDA devices in the system. Otherwise, if this message is emitted if a) the user doesn't have PSM2 hardware or b) doesn't have CUDA devices, the user will (rightfully) be quite confused. For example, you might replace this message with:
Warning: Open MPI has detected both PSM2-based hardware and CUDA hardware, but the PSM2_CUDA environment variable was not set. Open MPI has therefore defaulted to setting PSM2_CUDA=1.
This can leave to a performance penalty, however, if your application uses host buffers. You should set the PSM2_CUDA environment variable to 0 before invoking mpirun to both silence this warning and provide the hint to Open MPI / PSM2 that your application will mainly be using host buffers.
Local hostname: %s
(and include the local hostname in the message, just to let the user know which machine Open MPI is talking about)
You can have a similar message for the 2nd opal_show_help()
, below.
Also, keep in mind that the component registration function is really only intended to be used to register MCA vars -- it is not actually intended to do anything in terms of initialization or setup.
If you really really really have to set the PSM2_CUDA
environment variable all the way up here in the component registration function, you should include a lengthy comment about why you have to do so. And then you'll need to set some kind of state variable (on the component?) indicating that you do so so that you can emit an appropriate show_help message later (i.e., if/when the PSM2 MTL is selected to be used).
Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsquyres , will fix the help text output in separate commit against master and then cherry pick here.
FWIW: our normal procedure is to use opal_setenv to set it, mark in a flag that we did so, and then use opal_unsetenv to unset it in the close function. This protects the user from what @jsquyres describes. |
Oh, yes, unsetenving it in the component close function is a much better idea. 😄 |
PSM2 enables support for GPU buffers and CUDA managed memory and it can directly recognize GPU buffers, handle copies between HFIs and GPUs. Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases. In this patch, we allow the PSM2 MTL to specify when it does not require CUDA convertor support. This allows us to skip CUDA convertor init phases and lets PSM2 handle the memory transfers. This translates to improvements in latency. The patch enables blocking collectives and workloads with GPU contiguous, GPU non-contiguous memory. (cherry picked from commit 2e83cf1) Signed-off-by: Aravind Gopalakrishnan <[email protected]> Conflicts: opal/datatype/opal_convertor.h
If Open MPI is configured with CUDA, then user also should be using a CUDA build of PSM2 and therefore be setting PSM2_CUDA environment variable to 1 while using CUDA buffers for transfers. If we detect this setting to be missing, force set it. If user wants to use this build for regular (Host buffer) transfers, we allow the option of setting PSM2_CUDA=0, but print a warning message to user that it is not a recommended usage scenario. (cherry picked from commit f8a2b7f) Signed-off-by: Aravind Gopalakrishnan <[email protected]> Conflicts: ompi/mca/mtl/psm2/mtl_psm2_component.c
Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit 1daa80d) Conflicts: ompi/mca/mtl/psm2/mtl_psm2_component.c
The messages should be printed only in the event of CUDA builds and in the presence of supporting hardware and when PSM2 MTL has actually been selected for use. To this end, move help text output to component init phase. Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment variable and sanitize the help text message. Signed-off-by: Aravind Gopalakrishnan <[email protected]> (cherry picked from commit bea4503) Conflicts: ompi/mca/mtl/psm2/mtl_psm2_component.c
bot:ompi:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only read the code; it isn't possible for me to compile or test it.
@jsquyres : To allay any fears regarding testing: I have compile-tested this and also ran combinations of both CUDA and non-CUDA workloads against the resulting libmpi. They work fine 👍 |
@aravindksg I figured. I just wanted to qualify my review. 😄 |
@hppritcha I'm good with this PR. |
PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.
This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.
(cherry picked from commit 2e83cf1)
Signed-off-by: Aravind Gopalakrishnan [email protected]
Conflicts:
opal/datatype/opal_convertor.h