NCCL v2.29.7 Release
·
134 commits
to master
since this release
Device API & GIN Enhancements
- Adds multi-context support for GIN with the option to request for exclusive GIN contexts.
- Adds VA-based GIN signals plus strict window ordering.
- Adds advanced queue control for GIN, including queue depth, manual credit management and aggregation.
- Adds GIN support for platforms with no cross rail connectivity.
- Adds nLsaTeams to ncclCommQueryProperties.
- Decouples GIN from NET plugin and topology.
New device APIs for convenience
- Adds new device APIs for various device side operations.
- Introduces Copy, ReduceCopy, ReduceSum with various data types and ops.
Dynamic Memory Offload
- Adds ncclCommSuspend() / ncclCommResume() for releasing/restoring communicator memory.
- Adds basic memory overhead tracking infrastructure.
Built-in hybrid (LSA+GIN) symmetric kernel for ReduceScatter:
- Adds new hierarchical kernels to improve performance and scalability of ReduceScatter.
- Requires symmetric memory registration and GIN support.
- Symmetric GIN kernels can be disabled with NCCL_SYM_GIN_KERNELS_ENABLE=0.
Add support for Port Failover
- Allows internal IB/RoCE plugin to continue working transparently when network errors occur.
- Adds automatic port failover for GPUs having multiple local IB/RoCE ports/devices.
- Can be enabled by setting NCCL_IB_RESILIENCY_PORT_FAILOVER=1.
Symmetric memory improvements
- Adds support for abort in symmetric kernels.
- Adds NCCL_CHECK_MODE=DEBUG to validate symmetric buffers registration.
Project layout reorganization
- The
ext-*directories are moved toplugins(e.g.ext-net→plugins/net). irandnccl4pyare now underbindings.examplesis nowdocs/examples.
Other Improvements
- Uses different signals for different peers in the GIN barrier.
- Adds NCCL_NO_CACHE to force NCCL to always re-read selected env vars.
- Adds CMake install and find_package support.
- Adds CMake for NCCL4Py build and updates Cybind integration.
- Adds preliminary backwards compatibility support to enable running LSA kernels compiled with NCCL 2.29.2/3 on NCCL 2.29.7. This is not supported for GIN yet.
Bug fixes
- Fix problems related to the introduction of git_version.h. (Github Issue #1960)
- Fix oneRankReduce when the number of elements is not a multiple of block number. (Github Issue #1950)
- Improve GIN handling in ncclCommGetAsyncError. (Github Issue #2019)
- Fix memory initialization in P2P transport. (Github Issue #1962)
- Fix hang issue in send/receive scheduling of repeated sparse patterns.
- Fall back to cudaMemcpyAsync API when null/default stream is used for CE-based collective operations.
- Free symmetric window objects automatically during commFree.
- Fix a 16-bit overflow of signal and counter ids with GIN proxy.
- Reset GIN counters and signals upon ncclDevCommDestroy.
- Fix local data calculation during ncclGinIbP2PBarrier.
Other
- Update license to Apache 2.0.
Known Limitations
- Applications that use GIN APIs need to be recompiled with 2.29.7 to work with 2.29.7 runtime.
- The Profiler Inspector example does not currently compile under CMake. This will be fixed soon.
Acknowledgments
We thank the following contributors for their work on this release:
- @sphish, @LyricZhao for their contribution on improving the NCCL device API.
- @ruizhang1230, @Zhaojp-Frank, @guoyuhong, @argentea and the Amem project (https://github.com/inclusionAI/asystem-amem) for their contribution on dynamic memory offload.
We also thank the community for issue reports, testing, and feedback.