Skip to content

Add support for GPU buffers for PSM2 MTL #4172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 31, 2017
Merged

Conversation

aravindksg
Copy link
Contributor

PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.

This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.

(cherry picked from commit 2e83cf1)
Signed-off-by: Aravind Gopalakrishnan [email protected]

Conflicts:
opal/datatype/opal_convertor.h

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@matcabral matcabral requested review from bosilca and hjelmn September 5, 2017 16:58
@matcabral matcabral added this to the v2.0.4 milestone Sep 5, 2017
@matcabral matcabral modified the milestones: v2.1.3, v2.0.4 Sep 5, 2017
@jsquyres jsquyres removed their assignment Sep 5, 2017
@bosilca
Copy link
Member

bosilca commented Sep 5, 2017

The PR message mentions a conflict on opal/datatype/opal_convertor.h. I don't see any ...

@jsquyres
Copy link
Member

jsquyres commented Sep 5, 2017

@artpol84 @Di0gen This looks like the same periodic failure we see from Mellanox Jenkins: it claims that depcomp does not exist. Have you ever been able to track this down?

bot:mellanox:retest

@aravindksg
Copy link
Contributor Author

@bosilca , I had resolved the conflicts before committing the changes. There were two complaints from git:

  1. Regarding one of the copyright lines at the top of the file. (Resolution was to only introduce the relevant copyright line to the patch)
  2. Regarding the macro definition-
    #define CONVERTOR_HAS_REMOTE_SIZE 0x20000000

I left out this macro definition to resolve the conflict resulting from the cherry-pick.

@bwbarrett
Copy link
Member

ok to test

@artpol84
Copy link
Contributor

artpol84 commented Sep 6, 2017

@jsquyres unfortunately for now we don't know what causes depcomp to be missing. We'll try to keep an eye on this failure.

@artpol84
Copy link
Contributor

artpol84 commented Sep 6, 2017

But I saw this

20:08:57 ompi/Makefile.am: installing 'config/depcomp'
20:08:58 parallel-tests: installing 'config/test-driver'
20:09:17 configure.ac: installing 'config/ylwrap'
20:09:45 cp: skipping file ‘/hpc/local/share/automake-1.15/depcomp’, as it was replaced while being copied
20:09:56 autoreconf: Leaving directory `.'

Which may be a clue. Will try to investigate.

@jsquyres
Copy link
Member

jsquyres commented Sep 6, 2017

@artpol84 Cool. I think we saw this same symptom quite a while ago -- the conjecture at the time was that somehow there were directory trees being re-used by different Jenkins builds. Or somehow the autotools themselves were being re-installed behind the scenes (i.e., outside of Jenkins), which caused this race condition. Good luck.

@artpol84
Copy link
Contributor

artpol84 commented Sep 7, 2017

@jsquyres this should be fixed now, please let me know if this will be observed again.

@jsquyres
Copy link
Member

jsquyres commented Sep 7, 2017

@artpol84 Great, thank you! Just curious (since this went on for so long) -- what was the issue?

Re-run on the Cray to fix the opal_path_nfs test issue:

bot:ompi:retest

@artpol84
Copy link
Contributor

artpol84 commented Sep 7, 2017

@jsquyres it was infrastructure-related. depcomp was accidentally "updated" quite frequently without need. "cp" was confused by depcomp file that was "modified" during the copy operation, copy was aborted and later missing depcomp was causing a problem.

@bwbarrett
Copy link
Member

Looks like the Cray went offline during the test...

bot:ompi:retest

@aravindksg
Copy link
Contributor Author

Ping @bosilca : could you please Ack this PR as well? The code has already been reviewed and merged to master and v3.0.x branch (PR's 4143 and 4171)

@hppritcha
Copy link
Member

The RMs for the 2.1.x release are reluctant to take this feature unless its fixing a correctness bug.

@matcabral
Copy link
Contributor

Hi @hppritcha, this is not a bug fix, but a new feature. Without this patch OMPI does not support the GPU Direct functionality in PSM2 library.
Does this mean that new features will ONLY be accepted in 3.x ?
thanks,

@matcabral
Copy link
Contributor

Ok, I'm correcting the label to a performance bug and retracting my previous comment about new feature, and here is the rationale: CUDA support is already part of OMPI. Without this fix OMPI built with CUDA support runs significantly slower on a CUDA aware PSM2 library.

cuda_env = getenv("PSM2_CUDA");
if (!cuda_env || ( strcmp(cuda_env, "0") == 0) )
opal_output(0, "Warning: If running with device buffers, there is a"
" chance the application might fail. Try setting PSM2_CUDA=1.\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use opal_show_help() for these kinds of messages (because opal_show_help() deduplicates such messages when outputting to mpirun). There's also very little context given in this message to inform the user that the message is being emitted from the Open MPI PSM2 code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will fix this

ompi_mtl_psm2.super.mtl_flags |= MCA_MTL_BASE_FLAG_CUDA_INIT_DISABLE;

cuda_env = getenv("PSM2_CUDA");
if (!cuda_env || ( strcmp(cuda_env, "0") == 0) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PSM2_CUDA an environment variable that is exported by the PSM2 library? Or is that an env variable that is solely being used by Open MPI? If it is solely used by the PSM2 support in Open MPI, it should be an MCA variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an envvar exported by PSM2 library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually not "exported", but "read" by PSM2.

opal_convertor_copy_and_prepare_for_recv( \
ompi_mpi_local_convertor, \
&(datatype->super), \
count, \
addr, \
0, \
flags, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since the rest of the code here is scrupulously pretty, you might want to re-indent the \ at the end of this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@@ -153,6 +158,7 @@ do { \
datatype, \
addr, \
count, \
flags, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto with above -- you might want to re-indent flags to match the others (beware of tabs vs. spaces -- Open MPI style requires spaces and bans tabs). This happens a few other places in this PR, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix the indentation. Thought I converted tabs to spaces, but I'll double check.

@aravindksg
Copy link
Contributor Author

@jsquyres Since master and v3.0.x branch already pulled these changes in, shall I generate separate PR's to fix the warning message and indentation issues there?

@jsquyres
Copy link
Member

Do a separate PR for master to make the changes. Once that PR is merged in, you can cherry pick the commit(s) from that master PR to this PR (i.e., just add the commits here -- no need to make a 2nd PR).

Make sense?

@aravindksg
Copy link
Contributor Author

@jsquyres , a patch to address your concerns has been merged to master and cherry-picked here. Could you also please review?

@@ -45,3 +45,7 @@ Unknown path record query mechanism %s. Supported mechanisms are %s.
#
[message too big]
Message size %llu bigger than supported by PSM2 API. Max = %llu
#
[no psm2 cuda env]
Using CUDA enabled OpenMPI but PSM2_CUDA environment variable is %s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open MPI is two words, not one.

Also, please use full sentences and good grammar in these messages. These are meant to be formal help messages displayed to the user, not "good enough" messages that are only intended for developers. Also, the way you are adding an additional lengthy string via the call to opal_show_help() is... unconventional.

More below, where you actually call opal_show_help().

"Host buffers,\nthere will be a performance penalty"
" due to OMPI force setting this variable now.\n"
"Set environment variable to 0 if using Host buffers" );
setenv("PSM2_CUDA", "1", 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this lead to a segv if you put the "1" string into the environment, but then the PSM2 MTL is dlclose()d (because the PSM2 MTL was not selected)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a harmless setting. I tested this by increasing OFI MTL priority to be higher than PSM2 and I did see the environment variable being set. But the application itself worked fine. Also- PSM2_DEVICES envvar setting is now done in component registration phase (PR #3834).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure it's harmless. I think you may be leaving now-unallocated memory in the environ array when the PSM MTL is unloaded. It probably won't cause problems, but if anyone goes to try to read it -- especially if that virtual address is no longer valid -- it may cause problems. You might be able to test that a little more rigorously (this is one reason that passing values around in the environment is kinda sucky).

If I'm right, it may be best to strdup() the "1", even though that technically makes a minor memory leak.

And/or you might defer this setenv until you know that PSM2 is going to be used. Then you will have a much smaller window for the bogus memory pointer in environ to cause a problem (i.e., because it won't become bogus until the PSM MTL is closed during MPI_FINALIZE, but that's usually on the way out of the process, anyway).

"not set",
"Host buffers,\nthere will be a performance penalty"
" due to OMPI force setting this variable now.\n"
"Set environment variable to 0 if using Host buffers" );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use "OMPI" in a show help string -- the product name is "Open MPI" (two words).

I would suggest that you should have 2 different show_help messages here, rather than feeding in a lengthy string to opal_show_help().

Additionally -- this message is being displayed during the component registration phase. Isn't that far too early? I.e., won't it display on systems that don't even have PSM2 and/or CUDA hardware available?

I would think that you should only display these messages if a) the PSM2 MTL is actually selected and b) there are CUDA devices in the system. Otherwise, if this message is emitted if a) the user doesn't have PSM2 hardware or b) doesn't have CUDA devices, the user will (rightfully) be quite confused. For example, you might replace this message with:

Warning: Open MPI has detected both PSM2-based hardware and CUDA hardware, but the PSM2_CUDA environment variable was not set. Open MPI has therefore defaulted to setting PSM2_CUDA=1.

This can leave to a performance penalty, however, if your application uses host buffers. You should set the PSM2_CUDA environment variable to 0 before invoking mpirun to both silence this warning and provide the hint to Open MPI / PSM2 that your application will mainly be using host buffers.

Local hostname: %s

(and include the local hostname in the message, just to let the user know which machine Open MPI is talking about)

You can have a similar message for the 2nd opal_show_help(), below.

Also, keep in mind that the component registration function is really only intended to be used to register MCA vars -- it is not actually intended to do anything in terms of initialization or setup.

If you really really really have to set the PSM2_CUDA environment variable all the way up here in the component registration function, you should include a lengthy comment about why you have to do so. And then you'll need to set some kind of state variable (on the component?) indicating that you do so so that you can emit an appropriate show_help message later (i.e., if/when the PSM2 MTL is selected to be used).

Make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsquyres , will fix the help text output in separate commit against master and then cherry pick here.

@open-mpi open-mpi deleted a comment from bosilca Oct 5, 2017
@rhc54
Copy link
Contributor

rhc54 commented Oct 6, 2017

FWIW: our normal procedure is to use opal_setenv to set it, mark in a flag that we did so, and then use opal_unsetenv to unset it in the close function. This protects the user from what @jsquyres describes.

@jsquyres
Copy link
Member

jsquyres commented Oct 6, 2017

Oh, yes, unsetenving it in the component close function is a much better idea. 😄

aravindksg and others added 3 commits October 27, 2017 09:16
PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.

This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.

(cherry picked from commit 2e83cf1)
Signed-off-by: Aravind Gopalakrishnan <[email protected]>

Conflicts:
	opal/datatype/opal_convertor.h
If Open MPI is configured with CUDA, then user also should be using a CUDA build of
PSM2 and therefore be setting PSM2_CUDA environment variable to 1 while using
CUDA buffers for transfers. If we detect this setting to be missing, force set
it. If user wants to use this build for regular (Host buffer) transfers, we
allow the option of setting PSM2_CUDA=0, but print a warning
message to user that it is not a recommended usage scenario.

(cherry picked from commit f8a2b7f)
Signed-off-by: Aravind Gopalakrishnan <[email protected]>

Conflicts:
	ompi/mca/mtl/psm2/mtl_psm2_component.c
Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 1daa80d)

Conflicts:
	ompi/mca/mtl/psm2/mtl_psm2_component.c
The messages should be printed only in the event of CUDA builds and in the
presence of supporting hardware and when PSM2 MTL has actually been selected
for use. To this end, move help text output to component init phase.

Also use opal_setenv/unsetenv() for safer setting, unsetting of the environment
variable and sanitize the help text message.

Signed-off-by: Aravind Gopalakrishnan <[email protected]>
(cherry picked from commit bea4503)

Conflicts:
	ompi/mca/mtl/psm2/mtl_psm2_component.c
@aravindksg
Copy link
Contributor Author

@jsquyres : The concerns have been addressed in PR #4323 and cherry-picked here. Please review.

@jsquyres
Copy link
Member

bot:ompi:retest

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only read the code; it isn't possible for me to compile or test it.

@aravindksg
Copy link
Contributor Author

@jsquyres : To allay any fears regarding testing: I have compile-tested this and also ran combinations of both CUDA and non-CUDA workloads against the resulting libmpi. They work fine 👍

@jsquyres
Copy link
Member

@aravindksg I figured. I just wanted to qualify my review. 😄

@jsquyres
Copy link
Member

@hppritcha I'm good with this PR.

@hppritcha hppritcha merged commit b566ef8 into open-mpi:v2.x Oct 31, 2017
@aravindksg aravindksg deleted the v2.x branch October 31, 2017 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants