Skip to content

Topic/large msg #1177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 5, 2017
Merged

Topic/large msg #1177

merged 5 commits into from
Sep 5, 2017

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Dec 3, 2015

MPICH large_type_sendrec hangs with TCP BTL (#1174)

@lanl-ompi
Copy link
Contributor

Test FAILed.

@jsquyres
Copy link
Member

jsquyres commented Dec 3, 2015

bot:retest

@ggouaillardet
Copy link
Contributor

now it works for me with --mca btl top,self but partially because the message is split over 2 or 3 btls
If I explicitly uses one interface --mca btl_tcp_if_include eth0 it hangs.
The reason is writev with a 4GB message size always returns 0, regardless I use lo, eth0 or ib0
This is a rhel6 kernel (I suspect we hit an internal limit that might be different on an other kernel)

@bosilca
Copy link
Member Author

bosilca commented Dec 4, 2015

We either need a way to detect what are the limits supported by writev/readv or we need to force on all OSes the fragments to be less than 2^31 bytes.

@jsquyres
Copy link
Member

jsquyres commented Dec 6, 2015

@bosilca Are you talking about a configure test to figure out the member type of struct iovec.iov_len? I.e., are you concerned about older OS's that don't use a size_t?

@bosilca
Copy link
Member Author

bosilca commented Dec 6, 2015

No, I am concerned about the OS support with regard to the total amount of data that can be sent with a single writev operation. We already have discovered 2 cases, where the outcome is not exactly what we expected (@ggouaillardet RedHat version and OS X).

@jsquyres
Copy link
Member

jsquyres commented Dec 6, 2015

@bosilca Ah. Is there a configure-test way to figure this out, perchance? I dislike hard-coding for "in OS ABC, set the limit to X, in OS DEF, set the limit to Y, ...etc."

@bosilca
Copy link
Member Author

bosilca commented Dec 6, 2015

Not that I know of. That was what my previous comment was about, either we come up with a test, or we prevent the TCP BTL of handling more that the currently known minimum (2^31-1 bytes) per writev.

@jsquyres
Copy link
Member

jsquyres commented Dec 6, 2015

Can we open /dev/null and try to write 2^31+1 bytes?

@bosilca
Copy link
Member Author

bosilca commented Dec 7, 2015

The following code works on OS X. However, I confirmed before that writev with a sum of iov_len larger than 2^32 return EINVAL.

#include <stdio.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <math.h>

int main( int argc, char* argv[] )
{
    int fd, err, iovcnt;
    ssize_t rc;
    char filename[] = "/dev/null";
    struct iovec iov[2];

    fd = open(filename, O_WRONLY);
    if( fd < 0 ) {
        printf("Could not open file %s\n", filename);
        exit(-1);
    }

    iovcnt = 2;

    iov[0].iov_len = strlen(filename);
    iov[0].iov_base = filename;

    iov[1].iov_len = (size_t)pow(2, 32);
    iov[1].iov_base = malloc(iov[1].iov_len * sizeof(char));

    rc = writev(fd, iov, iovcnt);
    if( rc < 0 ) {
        err = errno;
        switch(err) {
          case EINVAL:
              printf("Got error EINVAL %s\n", strerror(err));
              break;
          default:
              printf("Got error %d %s\n", err, strerror(err));
        }
    }
    close(fd);
    return 0;
}

@jsquyres
Copy link
Member

Per discussion on the 26 Jan 2016:

  • OS X and BSD man pages state that the type is int32_t, not size_t (i.e., it's not an OS X-specific problem)
  • This PR changes our internal type from size_t to int32_t, but it doesn't check for overflow / loop around writev(2) to segment writing the full message

@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
…is_studio_dev

autogen: patch configure in order to correctly detect Solaris Studio …
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by line to this PR's commit.

@bosilca
Copy link
Member Author

bosilca commented Apr 28, 2017

@ggouaillardet, @jsquyres and @hjelmn please review and push. This fixed the long pending issue with large data transfers over TCP. It also fixes an issue that disabled the RDMA protocol for BTL without a registration function.

@bosilca
Copy link
Member Author

bosilca commented Aug 31, 2017

@ggouaillardet, @hjelmn and @bwbarrett please comment or merge.

@bwbarrett
Copy link
Member

@jsquyres, mind updating your review (you wanted signed-off-by, George added them)?

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments, mainly confused by history...

* != NULL. This new check is equivalent. Note: I feel this protocol
* needs work to better improve resource usage when running with a
* leave pinned protocol. */
if (btl->btl_register_mem && (btl->btl_rdma_pipeline_frag_size != 0) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone know why there used to be a check about register mem here? I honestly can't figure it out. @bosilca / @hjelmn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that someone wanted to prevent pipelining if we have support for RMA. For the problem at hand, if we don't enforce this limit we will need to alter the internals of the btl_tcp_send to cope with very large buffers.

mca_btl_tcp_module.super.btl_rdma_pipeline_frag_size = INT_MAX;
/* Some OSes have hard coded limits on how many bytes can be manipulated by each writev operation.
* Force a reasonable limit, to prevent overflowing a 32-bit integer (limit comes from BSD and OS X) */
mca_btl_tcp_module.super.btl_rdma_pipeline_frag_size = ((1UL<<31) - 1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really needs to be a #define and also, I'm not sure I can figure out exactly why the - 1024 from the comment; a bit more context as to why that instead of INT_MAX would be super useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1UL<<31) was found by empirically testing on different systems and then taking the lowest value (OS X). The 1024 was to have some slack before the strict limit imposed by some OSes. This number can certainly be lowered to the largest PML header, but in this particular instance I don't think such accuracy is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that to the comment? Otherwise, the whole series looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment has been updated to reflect the randomness of choice.

int32_t req_state;
int32_t req_lock;
bool req_throttle_sends;
int32_t req_pipeline_depth;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is req_pipeline depth different than the other variables that require ADD_SIZE_T atomics? It seems like either we have a more general problem (which I could believe) or we don't have a problem that needs fixing in this commit...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to have it as a size_t as it counts the number of descriptors inflight (4 by default).

of the same type.

Signed-off-by: George Bosilca <[email protected]>
as the writev and readv support a sum larger than a uint32_t
this version will work. For the other OSes a different patch
is required. This patch is a slight modification of the one
proposed by @ggouaillardet.

Signed-off-by: George Bosilca <[email protected]>
they are supposed to be unsigned, casting them to a signed
value for all atomic operations is as errorprone as handling
them as signed entities.

Signed-off-by: George Bosilca <[email protected]>
Some OSes have hardcoded limits to prevent overflowing over an int32_t.
We can either detect this at configure (which might be a nicer but
incomplete solution), or always force the pipelined protocol over TCP.
As it only covers data larger than 1GB, no performance penalty is to be
expected.

Signed-off-by: George Bosilca <[email protected]>
@bosilca
Copy link
Member Author

bosilca commented Sep 5, 2017

A long-term solution would be to check if the OS support large writes and adapt the code accordingly. A similar issue is discussed in the context of MPI IO in #2399.

@bwbarrett
Copy link
Member

@bosilca, were you going to merge this version into master?

@bosilca
Copy link
Member Author

bosilca commented Sep 5, 2017

Yes, this provides a partial fix until we have the configure check.

@bosilca bosilca merged commit dc538e9 into open-mpi:master Sep 5, 2017
@bosilca bosilca deleted the topic/large_msg branch October 4, 2017 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants