-
Notifications
You must be signed in to change notification settings - Fork 902
Topic/large msg #1177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic/large msg #1177
Conversation
104f6df
to
aecba4f
Compare
Test FAILed. |
bot:retest |
now it works for me with |
We either need a way to detect what are the limits supported by writev/readv or we need to force on all OSes the fragments to be less than 2^31 bytes. |
@bosilca Are you talking about a configure test to figure out the member type of |
No, I am concerned about the OS support with regard to the total amount of data that can be sent with a single writev operation. We already have discovered 2 cases, where the outcome is not exactly what we expected (@ggouaillardet RedHat version and OS X). |
@bosilca Ah. Is there a |
Not that I know of. That was what my previous comment was about, either we come up with a test, or we prevent the TCP BTL of handling more that the currently known minimum (2^31-1 bytes) per writev. |
Can we open /dev/null and try to write 2^31+1 bytes? |
The following code works on OS X. However, I confirmed before that writev with a sum of iov_len larger than 2^32 return EINVAL. #include <stdio.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <math.h>
int main( int argc, char* argv[] )
{
int fd, err, iovcnt;
ssize_t rc;
char filename[] = "/dev/null";
struct iovec iov[2];
fd = open(filename, O_WRONLY);
if( fd < 0 ) {
printf("Could not open file %s\n", filename);
exit(-1);
}
iovcnt = 2;
iov[0].iov_len = strlen(filename);
iov[0].iov_base = filename;
iov[1].iov_len = (size_t)pow(2, 32);
iov[1].iov_base = malloc(iov[1].iov_len * sizeof(char));
rc = writev(fd, iov, iovcnt);
if( rc < 0 ) {
err = errno;
switch(err) {
case EINVAL:
printf("Got error EINVAL %s\n", strerror(err));
break;
default:
printf("Got error %d %s\n", err, strerror(err));
}
}
close(fd);
return 0;
} |
Per discussion on the 26 Jan 2016:
|
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. |
…is_studio_dev autogen: patch configure in order to correctly detect Solaris Studio …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Signed-off-by line to this PR's commit.
cedf99d
to
8176806
Compare
@ggouaillardet, @jsquyres and @hjelmn please review and push. This fixed the long pending issue with large data transfers over TCP. It also fixes an issue that disabled the RDMA protocol for BTL without a registration function. |
cde0e6f
to
e79f5b1
Compare
@ggouaillardet, @hjelmn and @bwbarrett please comment or merge. |
@jsquyres, mind updating your review (you wanted signed-off-by, George added them)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments, mainly confused by history...
* != NULL. This new check is equivalent. Note: I feel this protocol | ||
* needs work to better improve resource usage when running with a | ||
* leave pinned protocol. */ | ||
if (btl->btl_register_mem && (btl->btl_rdma_pipeline_frag_size != 0) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that someone wanted to prevent pipelining if we have support for RMA. For the problem at hand, if we don't enforce this limit we will need to alter the internals of the btl_tcp_send to cope with very large buffers.
opal/mca/btl/tcp/btl_tcp_component.c
Outdated
mca_btl_tcp_module.super.btl_rdma_pipeline_frag_size = INT_MAX; | ||
/* Some OSes have hard coded limits on how many bytes can be manipulated by each writev operation. | ||
* Force a reasonable limit, to prevent overflowing a 32-bit integer (limit comes from BSD and OS X) */ | ||
mca_btl_tcp_module.super.btl_rdma_pipeline_frag_size = ((1UL<<31) - 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really needs to be a #define and also, I'm not sure I can figure out exactly why the - 1024 from the comment; a bit more context as to why that instead of INT_MAX would be super useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1UL<<31) was found by empirically testing on different systems and then taking the lowest value (OS X). The 1024 was to have some slack before the strict limit imposed by some OSes. This number can certainly be lowered to the largest PML header, but in this particular instance I don't think such accuracy is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add that to the comment? Otherwise, the whole series looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment has been updated to reflect the randomness of choice.
int32_t req_state; | ||
int32_t req_lock; | ||
bool req_throttle_sends; | ||
int32_t req_pipeline_depth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is req_pipeline depth different than the other variables that require ADD_SIZE_T atomics? It seems like either we have a more general problem (which I could believe) or we don't have a problem that needs fixing in this commit...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to have it as a size_t as it counts the number of descriptors inflight (4 by default).
e79f5b1
to
b5917d7
Compare
of the same type. Signed-off-by: George Bosilca <[email protected]>
as the writev and readv support a sum larger than a uint32_t this version will work. For the other OSes a different patch is required. This patch is a slight modification of the one proposed by @ggouaillardet. Signed-off-by: George Bosilca <[email protected]>
they are supposed to be unsigned, casting them to a signed value for all atomic operations is as errorprone as handling them as signed entities. Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
Some OSes have hardcoded limits to prevent overflowing over an int32_t. We can either detect this at configure (which might be a nicer but incomplete solution), or always force the pipelined protocol over TCP. As it only covers data larger than 1GB, no performance penalty is to be expected. Signed-off-by: George Bosilca <[email protected]>
b5917d7
to
d10522a
Compare
A long-term solution would be to check if the OS support large writes and adapt the code accordingly. A similar issue is discussed in the context of MPI IO in #2399. |
@bosilca, were you going to merge this version into master? |
Yes, this provides a partial fix until we have the configure check. |
MPICH large_type_sendrec hangs with TCP BTL (#1174)