Skip to content

NCCL v2.29.7 Release

Choose a tag to compare

@marksantesson marksantesson released this 27 Feb 20:38
· 134 commits to master since this release

Device API & GIN Enhancements

  • Adds multi-context support for GIN with the option to request for exclusive GIN contexts.
  • Adds VA-based GIN signals plus strict window ordering.
  • Adds advanced queue control for GIN, including queue depth, manual credit management and aggregation.
  • Adds GIN support for platforms with no cross rail connectivity.
  • Adds nLsaTeams to ncclCommQueryProperties.
  • Decouples GIN from NET plugin and topology.

New device APIs for convenience

  • Adds new device APIs for various device side operations.
  • Introduces Copy, ReduceCopy, ReduceSum with various data types and ops.

Dynamic Memory Offload

  • Adds ncclCommSuspend() / ncclCommResume() for releasing/restoring communicator memory.
  • Adds basic memory overhead tracking infrastructure.

Built-in hybrid (LSA+GIN) symmetric kernel for ReduceScatter:

  • Adds new hierarchical kernels to improve performance and scalability of ReduceScatter.
  • Requires symmetric memory registration and GIN support.
  • Symmetric GIN kernels can be disabled with NCCL_SYM_GIN_KERNELS_ENABLE=0.

Add support for Port Failover

  • Allows internal IB/RoCE plugin to continue working transparently when network errors occur.
  • Adds automatic port failover for GPUs having multiple local IB/RoCE ports/devices.
  • Can be enabled by setting NCCL_IB_RESILIENCY_PORT_FAILOVER=1.

Symmetric memory improvements

  • Adds support for abort in symmetric kernels.
  • Adds NCCL_CHECK_MODE=DEBUG to validate symmetric buffers registration.

Project layout reorganization

  • The ext-* directories are moved to plugins (e.g. ext-netplugins/net).
  • ir and nccl4py are now under bindings.
  • examples is now docs/examples.

Other Improvements

  • Uses different signals for different peers in the GIN barrier.
  • Adds NCCL_NO_CACHE to force NCCL to always re-read selected env vars.
  • Adds CMake install and find_package support.
  • Adds CMake for NCCL4Py build and updates Cybind integration.
  • Adds preliminary backwards compatibility support to enable running LSA kernels compiled with NCCL 2.29.2/3 on NCCL 2.29.7. This is not supported for GIN yet.

Bug fixes

  • Fix problems related to the introduction of git_version.h. (Github Issue #1960)
  • Fix oneRankReduce when the number of elements is not a multiple of block number. (Github Issue #1950)
  • Improve GIN handling in ncclCommGetAsyncError. (Github Issue #2019)
  • Fix memory initialization in P2P transport. (Github Issue #1962)
  • Fix hang issue in send/receive scheduling of repeated sparse patterns.
  • Fall back to cudaMemcpyAsync API when null/default stream is used for CE-based collective operations.
  • Free symmetric window objects automatically during commFree.
  • Fix a 16-bit overflow of signal and counter ids with GIN proxy.
  • Reset GIN counters and signals upon ncclDevCommDestroy.
  • Fix local data calculation during ncclGinIbP2PBarrier.

Other

  • Update license to Apache 2.0.

Known Limitations

  • Applications that use GIN APIs need to be recompiled with 2.29.7 to work with 2.29.7 runtime.
  • The Profiler Inspector example does not currently compile under CMake. This will be fixed soon.

Acknowledgments

We thank the following contributors for their work on this release:

We also thank the community for issue reports, testing, and feedback.