forked from libfabric-test1/libfabric
-
Notifications
You must be signed in to change notification settings - Fork 0
prov/efa: Changes for efa + shm peer provider integration #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: aws-ceenugal <[email protected]>
Signed-off-by: aws-ceenugal <[email protected]>
It is unnecessary to call progress engine immediately after triggering a handshake. rxr_msg_generic_send does this when an error is returned. Signed-off-by: Wenduo Wang <[email protected]>
Use the newly introduced FI_OPT_CUDA_API_PERMITTED option to replace the environment variable FI_HMEM_CUDA_ENABLE_XFER. Signed-off-by: Wei Zhang <[email protected]>
Signed-off-by: Wei Zhang <[email protected]>
Signed-off-by: Wei Zhang <[email protected]>
Signed-off-by: Wei Zhang <[email protected]>
This patch made 2 changes to the option FI_OPT_CUDA_API_PERMITTED. Clarify that setting of this option may return -FI_EINVAL if either CUDA library or CUDA device is not available. Promise that all providers that support FI_HMEM capability implement this option. Signed-off-by: Wei Zhang <[email protected]>
Signed-off-by: OFIWG Bot <[email protected]>
…n-pages-main Update nroff-generated man pages
shm should call the start_msg function for the peer that was matched to and queued the unexpected message in the first place (saved in the rx_entry), not the owner srx. Signed-off-by: Alexia Ingerson <[email protected]>
Separate the msg and tag generic receive paths. This removes redundant checks in both paths and also fixes the peer srx start call which was incorrectly always calling the start_msg function. In the tagged case, we should be calling the start_tag function instead. Signed-off-by: Alexia Ingerson <[email protected]>
The previous code was incorrectly calling discard_msg for both queues. The queued tagged messages should be calling the discard_tag call instead of the discard_msg call. Signed-off-by: Alexia Ingerson <[email protected]>
Owner
Author
172c488 to
585ee4a
Compare
When xnet_op_read_rsp() is executed, ep->cur_rx.entry is set to the head element of the list 'rma_read_queue'. The RX entry is effectively removed from the list only after the operation completes. This pattern becomes problematic if the EP is disabled before the RX completes (i.e., EP disabled because of a failing TX entry or an explicit shutdown). In this case, xnet_ep_flush_all_queues would complete the same RX entry twice: once when rma_read_queue is flushed (the RX entry is still listed here) and another time in the block for the condition "if (ep->cur_rx.entry)". The patch removes the RX entry from rma_read_queue before it gets assigned to ep->cur_rx.entry. It guarantees that the RX entry will be completed only once if the EP is disabled. Signed-off-by: Sylvain Didelot <[email protected]>
Signed-off-by: Kyle Gerheiser <[email protected]>
Include ofi_hmem.h to fix compilation issues on ROCR enabled systems. Particularly: implicit declaration of function ofi_hsa_amd_dereg_... Signed-off-by: Amir Shehata <[email protected]>
Update the SHM provider to use the ROCR HMEM asynchronous memory
operations.
. Unify the ipc and sar freestack, since they use the same structure.
. When progressing an IPC operation, check if the device is ROCR and trigger
an asynchronous operation.
. When an asynchronous operation is queued, create an IPC entry and
add it to the queue of pending operations.
. During the top level progress loop check the queue of pending asynchronous
operations and query the state of each one. Generate a complete event
for finished operations. Since completions happen outside the context
of libfabric we can't rely on the ep->region->signal flag to be set.
Always check the pending queue. This shouldn't introduce much of an
overhead if the queue is empty
. use the ep->util_ep lock to protect the free stack and the ipc list
of pending operations.
Signed-off-by: Amir Shehata <[email protected]>
Prior to this patch, smr_generic_rma will write error completion for any error return of smr_fast_rma. This patch makes smr_generic_rma to return -FI_EAGAIN, if smr_fast_rma() return -FI_EAGAIN. This is because -FI_EAGAIN error means the operation has not been completed. Signed-off-by: Wei Zhang <[email protected]>
Prior to this patch, smr_generic_rma() call smr_write_error_comp() with a negative errno, causing the output cq entry to have a negative errno. this patch addressed the issue by using a positive errno. Signed-off-by: Wei Zhang <[email protected]>
For peer cq, fi_cq_read is not expected to access cirq and reading cqes. It show only progress the cq. Signed-off-by: Shi Jin <[email protected]>
2d20ad1 to
eb31b62
Compare
Implement all the functions in efa_rdm_srx_owner_ops. Also refactor the code in rxr_pkt_proc_msgrtm() and rxr_pkt_proc_tagrtm() so they can be reused by efa_rdm_srx_owner_ops. Signed-off-by: Shi Jin <[email protected]>
Signed-off-by: Shi Jin <[email protected]>
Signed-off-by: Shi Jin <[email protected]>
After using shm provider as a peer, efa provider can post fi_sendmsg/fi_tsendmsg directly to shm ep, without involving RDM protocols. Also there is no need to pre-post fi_recv for shm provider as it will share the rx posted to efa ep. Signed-off-by: Shi Jin <[email protected]>
To share the efa cq to shm provider, we need to move shm's fi_cq_open call into efa_rdm_cq_open because application could create multiple efa cqs, and each efa cq must be shared to its corresponding shm cq. At the mean time, there is no need to poll shm cq explicitly in rxr_ep_progress, instead we just need to call fi_cq_read(shm_cq, NULL, 0) inside efa's fi_cq_read() to progress the shm cq manually. Signed-off-by: Shi Jin <[email protected]>
In smr_srx_context, shm should not modify the peer_ops of the imported srx. It should allocate a new srx and set it there. Signed-off-by: Shi Jin <[email protected]>
410cc9b to
3d77c02
Compare
shijin-aws
pushed a commit
that referenced
this pull request
Sep 12, 2023
If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 #10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 #11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.