Skip to content

Conversation

@shijin-aws
Copy link

No description provided.

@shijin-aws
Copy link
Author

comment

@shijin-aws
Copy link
Author

bot:aws:retest

4 similar comments
@shijin-aws
Copy link
Author

bot:aws:retest

@shijin-aws
Copy link
Author

bot:aws:retest

@shijin-aws
Copy link
Author

bot:aws:retest

@shijin-aws
Copy link
Author

bot:aws:retest

Sean Hefty and others added 24 commits December 2, 2020 04:13
The inject size usable by the application must be reduced
by the sizeof(rxm_pkt), in order for the entire data to
fit in the inject packet.

Signed-off-by: Sean Hefty <[email protected]>
Prior to direct send support patches merging, hmem
support added an rxm_mr structure.  That structure
now contains the mem_desc used by the core provider.
And the mem_desc passed to the user of rxm now points
to the rxm_mr structure.

Signed-off-by: Sean Hefty <[email protected]>
This macro is just a wrapper over another macro.  Call
the underlying macro directly to simplify the code.

Signed-off-by: Sean Hefty <[email protected]>
This macro simply calls through to another macro.  Call the
base macro directly to simplify the code.

Signed-off-by: Sean Hefty <[email protected]>
These 2 macros are identical except for an assert.  Merge
them into a single call to simplify the code.

Signed-off-by: Sean Hefty <[email protected]>
This is a long function called from several locations.  Move
it from the header file into a .c file to avoid duplicating
the function 10 times in the compiled source.

Drop the _flags suffix, which isn't needed or useful.

This patch does not change the actual function itself.

Signed-off-by: Sean Hefty <[email protected]>
Shorten the name and change it to better reflect the
behavior being performed, which is a stack allocated
duplication of the iov.

Swap the parameters, so that they align with the OFI
API order.

Signed-off-by: Sean Hefty <[email protected]>
If we don't have a memory descriptor, then inline must be
true regardless of other conditions.

Signed-off-by: Sean Hefty <[email protected]>
Some verbs devices do not support SEND_INLINE if the
iov count is > 1.  E.g. i40iw and irdma both fail to
transfer data beyond iov[0] when the SEND_INLINE flag
is present.

To fix this, we re-work vrb_send_iov so that we always
copy inline data onto the stack prior to posting the WR.
The copy support is there for device buffers, but is
skipped for host memory.  Perform the copy for host
memory as well.  Note that we don't check that the iov
count is > 1 in this call as an optimization, since
there are optimized code paths already for that case.
E.g. vrb_send_buf().

This fixes an issue where RxM uses direct sends to pass
application buffers directly through to the verbs
provider, resulting in only a portion of the data being
received.

Signed-off-by: Sean Hefty <[email protected]>
The multi-recv test handles its own memory registration,
but does not set mr_desc.  This is passed to fi_post_tx
for send operations.  The result is that mr_desc is NULL,
and not valid for providers expecting a usable descriptor.

This problem shows up with the new direct send mode added
to rxm over verbs.

Signed-off-by: Sean Hefty <[email protected]>
at this point a bunch of fixes everywhere
Currently, in function rxr_pkt_handle_receipt_recv(),
tx_entry->bytes_acked > 0 is used to determine whether
long cts protocols are used.

This is wrong because for long cts protocols, receipt can
arrive before the completion of the LONG_RTW/LONG_RTM
and DATA packet.

If that happens, tx_entry->bytes_acked would be 0 even for
long cts protocols.

Given the above situation, this patch does the following
changes:

1. Introduced a flag called "RXR_LONGCTS_PROTOCOL", which
shows whether the sender side is using long cts protocols

2. In rxr_pkt_handle_receipt_recv(), we use the
RXR_LONGCTS_PROTOCOL flag to determine whether long cts
protocols are used instead of checking tx_entry->bytes_acked.

Signed-off-by: Ao Li <[email protected]>
    
The travis OS X build hangs, printing a warning that the
default xcode is no longer supported by the app store.  Force
update to the latest OS X version as of this patch.

Signed-off-by: Sean Hefty <[email protected]>
prov/efa: Fix a Bug in rxr_pkt_handle_receipt_recv
travis: set osx_image to a later (supported) version
travis: Fixup travis.yml file syntax
Reverts commit cd459ce.

The bug seems to be somewhere in the L0 caching,
not the protocol.

Signed-off-by: aingerson <[email protected]>
I've recently got the case where libfabric tried to execute
ibv_query_device() with a NULL context (struct ibv_context *),
which caused the process to segv with the following call stack:

ibv_query_device  [/lib/x86_64-linux-gnu/libibverbs.so.1, 0x7ffff7f79000]
vrb_init_info  [xxx, 0x555555554000]
fi_verbs_ini  [xxx, 0x555555554000]
fi_ini  [xxx, 0x555555554000]
fi_getinfo  [xxx, 0x555555554000]

That crash happened because one network port was somehow malfunctioning
and its verbs context was NULL. The root cause of that malfunctioning
port is unknown but rebooting the node seems to have fixed the problem.

Below is an ibstat output for the faulty port:
ibpanic: [38248] main: stat of IB device 'mlx5_3' failed: Resource temporarily unavailable

As a fix, we should just skip the interfaces that report a NULL context
instead of crashing. That's especially true as the faulty network device
is not necessarily the one the end-user wants to use with libfabric.

Signed-off-by: Sylvain Didelot <[email protected]>
prov/verbs: skip devices with NULL contexts
We can not assume CMA always works when sending to self.
In the container context, CMA to self is not permitted because
the syscalls are already blocked by dropping CAP_SYS_PTRACE
https://docs.docker.com/engine/security/seccomp/.

Fix the issue by enabling CMA check for endpoint itself.
The CMA capability is splitted into cma_cap_self and cma_cap_peer.
Since cma_cap_self and cma_cap_peer might be different, we can
not use field 'cma_cap' as struct iovec in the test CMA call
any more, so use field 'pid' instead.

Signed-off-by: Jie Zhang <[email protected]>
Currently rxr_ep_post_ctrl_or_queue() is used to posting REQ
packets, which will push the tx_entry in tx_pending_list if
-FI_EAGAIN was encountered.

This is wrong if the REQ is RTM and shm provider is used as
lower provider, because queuing will mess up message ordering.

It is unecessary for other cases, because application can
handle -FI_EAGAIN.

To address this issue, this patch uses rxr_ep_post_ctrl() to
post REQ packets.

Signed-off-by: Wei Zhang <[email protected]>
Currently, in function rxr_msg_generic_send(). If rxr_msg_post_rtm()
returned error, the tx_entry will be released. In this case, because
rxr_release_tx_entry() does not release memory registration in
tx_entry->mr, the memory registration will be leaked.

This patch address this issue by adding a call to
rxr_tx_entry_mr_dereg() to before calling
rxr_rlease_tx_entry() in the error handling path.

Signed-off-by: Wei Zhang <[email protected]>
Sean Hefty and others added 27 commits February 18, 2021 18:13
Introduce macros that can be used to easily define variables
or fields that should only exist in debug builds.  These
macros replace ifdef's in the code and are expected to be
paired with assert checks.

Update a couple of places where debug fields are defined to
use the macros.

Signed-off-by: Sean Hefty <[email protected]>
There are some differences with non-blocking sockets between Linux and
Windows:
1) connect() on Windows may return EWOULDBLOCK, not EINPROGRESS
2) recv() with MSG_WAITALL flag returns EOPNOTSUPP

Signed-off-by: Andrey Lobanov <[email protected]>
Destroy rxm_domain->amo_bufpool_lock in rxm_domain_close and in
error handling code to avoid resource leak.

Signed-off-by: Andrey Lobanov <[email protected]>
prov/efa: exclude EFA when FI_HMEM and FI_LOCAL_COMM are selected
ci/travis: Add psm3 to Travis CI
prov/tcp: fixed hang on windows
Before building the provider library, 3 steps are inserted into the make
dependency tree:
 1) link all objects/archives in provider into single libpsm3_full.lo
 2) using objcopy tool, remove all but fi_psm3_ini from global symbols
 3) repack new .o file into libpsm3.a (and make libpsm3.la wrapper)

This will effectively isolate the psm3 provider code to not allow name
collisions with other providers with similar function names when linked
into single libfabric library. As well as preserve function names for
debugging.

Signed-off-by: Goldman, Adam <[email protected]>
hmem: Fix rocr_is_addr_valid() check for AMD GPU memory
Add magic values to verify that users do not access buffers
outside of the exposed areas.

Signed-off-by: Sean Hefty <[email protected]>
prov/rxm: cleanup rxm_domain->amo_bufpool_lock
prov/psm3: Only export fi_psm3_ini when building into libfabric
remove the mlx and mxm sections

Signed-off-by: Sean Hefty <[email protected]>
bufpool: check for buffer overruns, idm: add callback to handle cleanup
readme: Remove deprecated providers from readme.md
Revert "prov/psm3: Disable psm3 provider if with-psm2-src is used"
make distcheck fails trying to use a relative path, rather
top_srcdir.

Signed-off-by: Sean Hefty <[email protected]>
Signed-off-by: Sean Hefty <[email protected]>
news: Update news for v1.12 release
Signed-off-by: Goldman, Adam <[email protected]>
Signed-off-by: Robert Wespetal <[email protected]>
prov/psm3: Add missing ifdefs for rv module
news: add EFA updates to 1.12 news
shijin-aws referenced this pull request in shijin-aws/libfabric Feb 25, 2021
Problem reported by Address Sanitizer:

=================================================================
    ==25220==ERROR: AddressSanitizer: heap-use-after-free on address 0x6270000072e0 at pc 0x00010b926a3c bp 0x700001bd1c30 sp 0x700001bd1c28
    READ of size 4 at 0x6270000072e0 thread T4
        #0 0x10b926a3b in sock_conn_listener_thread (libfabric.1.dylib:x86_64+0xdca3b)
        #1 0x7fff7e2d5660 in _pthread_body (libsystem_pthread.dylib:x86_64+0x3660)
        #2 0x7fff7e2d550c in _pthread_start (libsystem_pthread.dylib:x86_64+0x350c)
        #3 0x7fff7e2d4bf8 in thread_start (libsystem_pthread.dylib:x86_64+0x2bf8)

    0x6270000072e0 is located 480 bytes inside of 12944-byte region [0x627000007100,0x62700000a390)
    freed by thread T0 here:
        #0 0x10baf1a9d in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x56a9d)
        #1 0x10b9016bf in sock_ep_close (libfabric.1.dylib:x86_64+0xb76bf)
        #2 0x10b7f4a8f in fi_close fabric.h:593
        #3 0x10b7f4209 in main shared_ctx.c:649
        #4 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014)

    previously allocated by thread T0 here:
        #0 0x10baf1e27 in wrap_calloc (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x56e27)
        #1 0x10b906df4 in sock_alloc_endpoint (libfabric.1.dylib:x86_64+0xbcdf4)
        #2 0x10b8f7fdb in sock_msg_ep (libfabric.1.dylib:x86_64+0xadfdb)
        #3 0x10b7f7c93 in fi_endpoint fi_endpoint.h:164
        #4 0x10b7f5e40 in server_connect shared_ctx.c:471
        #5 0x10b7f49ba in run shared_ctx.c:573
        #6 0x10b7f411b in main shared_ctx.c:647
        #7 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014)

    Thread T4 created by T0 here:
        #0 0x10bae999d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x4e99d)
        #1 0x10b925f9b in sock_conn_start_listener_thread (libfabric.1.dylib:x86_64+0xdbf9b)
        #2 0x10b8e7eb2 in sock_domain (libfabric.1.dylib:x86_64+0x9deb2)
        #3 0x10b7f87d3 in fi_domain fi_domain.h:306
        #4 0x10b7f5c9f in server_connect shared_ctx.c:460
        #5 0x10b7f49ba in run shared_ctx.c:573
        #6 0x10b7f411b in main shared_ctx.c:647
        #7 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014)

The issue shows up more frequently on OS X, which emulates epoll.  However, I believe the
problem could occur on any platform.

In sock_ep_close, we remove the socket from the epoll fd, then free the endpoint.
However, if the listener thread has received an event on the socket, but has not
yet started processing it, then a race can occur.  The listener thread could have
returned from ofi_epoll_wait, but suspended trying to acquire the signal_lock.
The signal_lock is acquired from sock_ep_close, where ofi_epoll_del is called, then
released.  The endpoint is then freed.  The listener thread can now acquire the
signal_lock, where it will attempt to access the freed endpoint data.

To avoid the race, we add a change boolean to the listener.  That boolean is
only changed while holding the signal_lock.  When a socket is removed from the
epollfd, we mark the listener state as 'changed'.  The listener thread checks the
changed state prior to processing any events.  If set, it clears the state, and
calls ofi_epoll_wait again to get a new set of events to process.

Note that this works for epoll set to level-triggered (poll semantics).
Sockets that reported events will report those same events when wait is called
a second time.  Sockets which were removed from the epoll set would have their
events removed, as they are no longer being monitored.

This fix is applied both to the listener thread and cm thread.

Signed-off-by: Sean Hefty <[email protected]>
shijin-aws referenced this pull request in shijin-aws/libfabric Sep 1, 2021
I'm not entirely sure if it is fixes the issue our QA is seeing
(as they get err_entry.err=-104 - a wrong negative value), but
with error injection I could easily trigger a use-after-free
with the root from this function (with err_entry.err=104, though,
so I still don't know where the wrong error sign came from).

In my error injection reproducer ofi_send_socket() fails sometimes,
which then triggers free of cm_ctx without removing the fd and
cm_ctx from polling. Next poll round will then access cm_ctx and
trigger a use-after-free.

client_send_connreq
    tx_cm_data
        ofi_send_socket -> fails
    goto err
    ...
err:
    free(cm_ctx)

ASAN reports

READ of size 4 at 0x6120000106c8 thread T4 (rpc_poll-0)
    #0 0x7f77005e0f21 in process_cm_ctx prov/tcp/src/tcpx_conn_mgr.c:482
    #1 0x7f77005e15ef in tcpx_conn_mgr_run prov/tcp/src/tcpx_conn_mgr.c:535
    #2 0x7f77005fc429 in tcpx_eq_read prov/tcp/src/tcpx_eq.c:48
    #3 0x4926dd in fi_eq_read /home/bschubert/local/rhel7/libfabric/include/rdma/fi_eq.h:352

0x6120000106c8 is located 8 bytes inside of 280-byte region [0x6120000106c0,0x6120000107d8)
freed by thread T4 (rpc_poll-0) here:
    #0 0x7f77015915e7 in __interceptor_free
    #1 0x7f77005e083b in client_send_connreq prov/tcp/src/tcpx_conn_mgr.c:422
    #2 0x7f77005e0f7e in process_cm_ctx prov/tcp/src/tcpx_conn_mgr.c:487
    #3 0x7f77005e15ef in tcpx_conn_mgr_run prov/tcp/src/tcpx_conn_mgr.c:535
    #4 0x7f77005fc429 in tcpx_eq_read prov/tcp/src/tcpx_eq.c:48

previously allocated by thread T5 (rpc_conn_mgr) here:
    #0 0x7f7701591b7e in __interceptor_calloc
    #1 0x7f77005edb5c in tcpx_ep_connect prov/tcp/src/tcpx_ep.c:103
    #2 0x478b2f in fi_connect /home/bschubert/local/rhel7/libfabric/include/rdma/fi_cm.h:98

Signed-off-by: Bernd Schubert <[email protected]>
shijin-aws referenced this pull request in shijin-aws/libfabric Feb 17, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.