Skip to content

Conversation

@hoopoepg
Copy link
Contributor

  • in case if UCX memory hooks could not be used try to fallback
    into OPAL memory hooks

fixes #9859

@devreal
Copy link
Contributor

devreal commented Jan 31, 2022

@hoopoepg I can confirm that the warning is gone. However, performance is as bad as with the warning:

With the warning on v5.0.x:

$ mpirun -n 2 -N 1 --mca pml ucx --mca btl uct  ./mpi/pt2pt/osu_latency
[r41c1t7n4:540459] ../../../../../opal/mca/common/ucx/common_ucx.c:390  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[r41c1t7n3:861732] ../../../../../opal/mca/common/ucx/common_ucx.c:390  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
# OSU MPI Latency Test v5.6.2
# Size          Latency (us)
0                       1.15
1                       1.14
2                       1.14
4                       1.15
8                       1.15
16                      1.15
32                      1.25
64                      1.33
128                     1.38
256                     1.75
512                     1.94
1024                    2.34
2048                    2.30
4096                    2.85
8192                    3.41
16384                   4.75
32768                   6.30
65536                   8.98
131072                 14.69
262144                 95.92
524288                119.52
1048576               163.61
2097152               249.98
4194304               440.16

With suggested --mca opal_common_ucx_opal_mem_hooks 1:

mpirun -n 2 -N 1 --mca pml ucx --mca btl uct --mca opal_common_ucx_opal_mem_hooks 1  ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.6.2
# Size          Latency (us)
0                       1.15
1                       1.14
2                       1.14
4                       1.15
8                       1.15
16                      1.15
32                      1.27
64                      1.34
128                     1.38
256                     1.83
512                     1.96
1024                    2.37
2048                    2.31
4096                    2.87
8192                    3.45
16384                   4.90
32768                   6.50
65536                   9.40
131072                 15.36
262144                 16.16
524288                 27.31
1048576                49.70
2097152                94.23
4194304               183.20

With this PR:

$ mpirun -n 2 -N 1 --mca pml ucx ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.6.2
# Size          Latency (us)
0                       1.15
1                       1.15
2                       1.15
4                       1.16
8                       1.16
16                      1.16
32                      1.26
64                      1.35
128                     1.39
256                     1.81
512                     1.95
1024                    2.34
2048                    2.32
4096                    2.87
8192                    3.45
16384                   4.75
32768                   6.45
65536                   9.33
131072                 15.31
262144                 96.10
524288                119.81
1048576               165.42
2097152               252.12
4194304               440.81

This PR and btl/uct disabled:

$ mpirun -n 2 -N 1 --mca pml ucx --mca btl ^uct  ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.6.2
# Size          Latency (us)
0                       1.19
1                       1.18
2                       1.18
4                       1.18
8                       1.20
16                      1.18
32                      1.25
64                      1.35
128                     1.39
256                     1.81
512                     1.94
1024                    2.33
2048                    2.39
4096                    2.85
8192                    3.52
16384                   4.77
32768                   6.33
65536                   9.06
131072                 14.71
262144                 16.39
524288                 27.60
1048576                49.87
2097152                94.69
4194304               184.31

Notice the difference at 4MB: roughly 183us vs 440us. Is there no way we can fix the performance degradation?

@hoopoepg
Copy link
Contributor Author

@yosefe it seems UCX failed to initialise rcache. there is no gracefully way to fallback into OPAL hooks. may be enable OPAL hooks by default? wdyt?

@yosefe
Copy link
Contributor

yosefe commented Jan 31, 2022

@yosefe it seems UCX failed to initialise rcache. there is no gracefully way to fallback into OPAL hooks. may be enable OPAL hooks by default? wdyt?

Maybe we should start by testing external (OPAL) events, and if they already exist - do not install UCX events.
Also - register external events handler anyway, in case a module loaded after UCX would overwrite UCX's events.
To clarify, we would never actually install OPAL events in pml_ucx according to this suggestion; only test and register a callback

@hoopoepg
Copy link
Contributor Author

Maybe we should start by testing external (OPAL) events, and if they already exist - do not install UCX events. Also - register external events handler anyway, in case a module loaded after UCX would overwrite UCX's events. To clarify, we would never actually install OPAL events in pml_ucx according to this suggestion; only test and register a callback

will not work: OPAL events are installed after UCX installs own hooks

@yosefe
Copy link
Contributor

yosefe commented Jan 31, 2022

will not work: OPAL events are installed after UCX installs own hooks

So it means UCX events test should work, right?
Can we always register OPAL memory callback in pml/ucx and assume that if OPAL events are installed after UCX, they will call this callback?

@hoopoepg
Copy link
Contributor Author

OPAL sets hooks when framework is opened. if we subscribe to OPAL hooks and after this UCX set own hooks we may break OPAL hooks completely - btl ucx may fail

@hoopoepg hoopoepg closed this Feb 1, 2022
@hoopoepg hoopoepg force-pushed the topic/ucx-fallback-to-opal-hooks branch from 5daf180 to de4fe77 Compare February 1, 2022 11:02
@hoopoepg hoopoepg reopened this Feb 1, 2022
@hoopoepg
Copy link
Contributor Author

hoopoepg commented Feb 1, 2022

@devreal it seems the only solution for issue is use OPAL hooks by default.
could you run your tests on updated PR?
I reopened PR with updated fix
thank you

- enable OPAL memory hooks by default to provide compatibility
  with other transports

Signed-off-by: Sergey Oblomov <[email protected]>
@hoopoepg hoopoepg force-pushed the topic/ucx-fallback-to-opal-hooks branch from eec2955 to 62f6be4 Compare February 3, 2022 12:02
@hoopoepg hoopoepg requested a review from yosefe February 3, 2022 12:02
@devreal
Copy link
Contributor

devreal commented Feb 3, 2022

I'm still trying to test this PR but had issues with the machine. Will try again today.

@janjust
Copy link
Contributor

janjust commented Feb 7, 2022

@devreal Did you get a chance to test it? I'd like to merge this into v5.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warning: UCX is unable to handle VM_UNMAP

4 participants