Topic/large msg #1177

bosilca · 2015-12-03T15:51:25Z

MPICH large_type_sendrec hangs with TCP BTL (#1174)

lanl-ompi · 2015-12-03T16:03:53Z

Test FAILed.

jsquyres · 2015-12-03T18:23:27Z

bot:retest

ggouaillardet · 2015-12-04T12:11:54Z

now it works for me with --mca btl top,self but partially because the message is split over 2 or 3 btls
If I explicitly uses one interface --mca btl_tcp_if_include eth0 it hangs.
The reason is writev with a 4GB message size always returns 0, regardless I use lo, eth0 or ib0
This is a rhel6 kernel (I suspect we hit an internal limit that might be different on an other kernel)

bosilca · 2015-12-04T22:02:55Z

We either need a way to detect what are the limits supported by writev/readv or we need to force on all OSes the fragments to be less than 2^31 bytes.

jsquyres · 2015-12-06T19:16:58Z

@bosilca Are you talking about a configure test to figure out the member type of struct iovec.iov_len? I.e., are you concerned about older OS's that don't use a size_t?

bosilca · 2015-12-06T19:19:48Z

No, I am concerned about the OS support with regard to the total amount of data that can be sent with a single writev operation. We already have discovered 2 cases, where the outcome is not exactly what we expected (@ggouaillardet RedHat version and OS X).

jsquyres · 2015-12-06T21:01:27Z

@bosilca Ah. Is there a configure-test way to figure this out, perchance? I dislike hard-coding for "in OS ABC, set the limit to X, in OS DEF, set the limit to Y, ...etc."

bosilca · 2015-12-06T21:11:54Z

Not that I know of. That was what my previous comment was about, either we come up with a test, or we prevent the TCP BTL of handling more that the currently known minimum (2^31-1 bytes) per writev.

jsquyres · 2015-12-06T21:25:12Z

Can we open /dev/null and try to write 2^31+1 bytes?

bosilca · 2015-12-07T02:44:33Z

The following code works on OS X. However, I confirmed before that writev with a sum of iov_len larger than 2^32 return EINVAL.

#include <stdio.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <math.h>

int main( int argc, char* argv[] )
{
    int fd, err, iovcnt;
    ssize_t rc;
    char filename[] = "/dev/null";
    struct iovec iov[2];

    fd = open(filename, O_WRONLY);
    if( fd < 0 ) {
        printf("Could not open file %s\n", filename);
        exit(-1);
    }

    iovcnt = 2;

    iov[0].iov_len = strlen(filename);
    iov[0].iov_base = filename;

    iov[1].iov_len = (size_t)pow(2, 32);
    iov[1].iov_base = malloc(iov[1].iov_len * sizeof(char));

    rc = writev(fd, iov, iovcnt);
    if( rc < 0 ) {
        err = errno;
        switch(err) {
          case EINVAL:
              printf("Got error EINVAL %s\n", strerror(err));
              break;
          default:
              printf("Got error %d %s\n", err, strerror(err));
        }
    }
    close(fd);
    return 0;
}

jsquyres · 2016-01-26T16:26:04Z

Per discussion on the 26 Jan 2016:

OS X and BSD man pages state that the type is int32_t, not size_t (i.e., it's not an OS X-specific problem)
This PR changes our internal type from size_t to int32_t, but it doesn't check for overflow / loop around writev(2) to segment writing the full message

ibm-ompi · 2016-05-21T00:35:07Z

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

…is_studio_dev autogen: patch configure in order to correctly detect Solaris Studio …

jsquyres

Please add a Signed-off-by line to this PR's commit.

bosilca · 2017-04-28T05:22:25Z

@ggouaillardet, @jsquyres and @hjelmn please review and push. This fixed the long pending issue with large data transfers over TCP. It also fixes an issue that disabled the RDMA protocol for BTL without a registration function.

bosilca · 2017-08-31T18:08:42Z

@ggouaillardet, @hjelmn and @bwbarrett please comment or merge.

bwbarrett · 2017-08-31T18:11:00Z

@jsquyres, mind updating your review (you wanted signed-off-by, George added them)?

bwbarrett

Couple of minor comments, mainly confused by history...

bwbarrett · 2017-08-31T18:25:00Z

ompi/mca/pml/ob1/pml_ob1_recvreq.c

-         * != NULL. This new check is equivalent. Note: I feel this protocol
-         * needs work to better improve resource usage when running with a
-         * leave pinned protocol. */
-        if (btl->btl_register_mem && (btl->btl_rdma_pipeline_frag_size != 0) &&


Does anyone know why there used to be a check about register mem here? I honestly can't figure it out. @bosilca / @hjelmn?

I assumed that someone wanted to prevent pipelining if we have support for RMA. For the problem at hand, if we don't enforce this limit we will need to alter the internals of the btl_tcp_send to cope with very large buffers.

bwbarrett · 2017-08-31T18:26:31Z

opal/mca/btl/tcp/btl_tcp_component.c

-    mca_btl_tcp_module.super.btl_rdma_pipeline_frag_size = INT_MAX;
+    /* Some OSes have hard coded limits on how many bytes can be manipulated by each writev operation.
+     * Force a reasonable limit, to prevent overflowing a 32-bit integer (limit comes from BSD and OS X) */
+    mca_btl_tcp_module.super.btl_rdma_pipeline_frag_size = ((1UL<<31) - 1024);


This really needs to be a #define and also, I'm not sure I can figure out exactly why the - 1024 from the comment; a bit more context as to why that instead of INT_MAX would be super useful.

(1UL<<31) was found by empirically testing on different systems and then taking the lowest value (OS X). The 1024 was to have some slack before the strict limit imposed by some OSes. This number can certainly be lowered to the largest PML header, but in this particular instance I don't think such accuracy is necessary.

Can you add that to the comment? Otherwise, the whole series looks good to me.

Comment has been updated to reflect the randomness of choice.

bwbarrett · 2017-08-31T18:28:26Z

ompi/mca/pml/ob1/pml_ob1_sendreq.h

+    int32_t  req_state;
+    int32_t  req_lock;
+    bool     req_throttle_sends;
+    int32_t  req_pipeline_depth;


Why is req_pipeline depth different than the other variables that require ADD_SIZE_T atomics? It seems like either we have a more general problem (which I could believe) or we don't have a problem that needs fixing in this commit...

There is no need to have it as a size_t as it counts the number of descriptors inflight (4 by default).

of the same type. Signed-off-by: George Bosilca <[email protected]>

@ggouaillardet

as the writev and readv support a sum larger than a uint32_t this version will work. For the other OSes a different patch is required. This patch is a slight modification of the one proposed by @ggouaillardet. Signed-off-by: George Bosilca <[email protected]>

they are supposed to be unsigned, casting them to a signed value for all atomic operations is as errorprone as handling them as signed entities. Signed-off-by: George Bosilca <[email protected]>

Signed-off-by: George Bosilca <[email protected]>

Some OSes have hardcoded limits to prevent overflowing over an int32_t. We can either detect this at configure (which might be a nicer but incomplete solution), or always force the pipelined protocol over TCP. As it only covers data larger than 1GB, no performance penalty is to be expected. Signed-off-by: George Bosilca <[email protected]>

bosilca · 2017-09-05T15:38:23Z

A long-term solution would be to check if the OS support large writes and adapt the code accordingly. A similar issue is discussed in the context of MPI IO in #2399.

bwbarrett · 2017-09-05T17:27:50Z

@bosilca, were you going to merge this version into master?

bosilca · 2017-09-05T17:29:48Z

Yes, this provides a partial fix until we have the configure check.

bosilca force-pushed the topic/large_msg branch from 104f6df to aecba4f Compare December 3, 2015 15:54

bosilca mentioned this pull request Dec 3, 2015

MPICH large_type_sendrec hangs with TCP BTL #1174

Closed

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016

Merge pull request open-mpi#1177 from ggouaillardet/topic/v1.10/solar…

e83f9bd

…is_studio_dev autogen: patch configure in order to correctly detect Solaris Studio …

jsquyres requested changes Oct 25, 2016

View reviewed changes

bosilca force-pushed the topic/large_msg branch from cedf99d to 8176806 Compare April 27, 2017 21:53

bosilca force-pushed the topic/large_msg branch from cde0e6f to e79f5b1 Compare August 31, 2017 14:45

jsquyres approved these changes Aug 31, 2017

View reviewed changes

bwbarrett reviewed Aug 31, 2017

View reviewed changes

bosilca force-pushed the topic/large_msg branch from e79f5b1 to b5917d7 Compare September 1, 2017 22:52

bosilca added 5 commits September 1, 2017 18:52

Be consistent for atomic operations and add an entity

4db3730

of the same type. Signed-off-by: George Bosilca <[email protected]>

Make the pipeline depth an int instead of a size_t. While

050bd3b

they are supposed to be unsigned, casting them to a signed value for all atomic operations is as errorprone as handling them as signed entities. Signed-off-by: George Bosilca <[email protected]>

Always abide to the RDMA pipeline limit.

866899e

Signed-off-by: George Bosilca <[email protected]>

bosilca force-pushed the topic/large_msg branch from b5917d7 to d10522a Compare September 1, 2017 22:53

bwbarrett approved these changes Sep 2, 2017

View reviewed changes

bosilca merged commit dc538e9 into open-mpi:master Sep 5, 2017

bosilca deleted the topic/large_msg branch October 4, 2017 22:18

Topic/large msg #1177

Topic/large msg #1177

Uh oh!

Conversation

bosilca commented Dec 3, 2015

Uh oh!

lanl-ompi commented Dec 3, 2015

Uh oh!

jsquyres commented Dec 3, 2015

Uh oh!

ggouaillardet commented Dec 4, 2015

Uh oh!

bosilca commented Dec 4, 2015

Uh oh!

jsquyres commented Dec 6, 2015

Uh oh!

bosilca commented Dec 6, 2015

Uh oh!

jsquyres commented Dec 6, 2015

Uh oh!

bosilca commented Dec 6, 2015

Uh oh!

jsquyres commented Dec 6, 2015

Uh oh!

bosilca commented Dec 7, 2015

Uh oh!

jsquyres commented Jan 26, 2016

Uh oh!

ibm-ompi commented May 21, 2016

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

bosilca commented Apr 28, 2017

Uh oh!

bosilca commented Aug 31, 2017

Uh oh!

bwbarrett commented Aug 31, 2017

Uh oh!

bwbarrett left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bosilca commented Sep 5, 2017

Uh oh!

bwbarrett commented Sep 5, 2017

Uh oh!

bosilca commented Sep 5, 2017

Uh oh!

Uh oh!