Skip to content

Conversation

@Scusemua
Copy link

@Scusemua Scusemua commented Jan 9, 2026

Summary:
TL;DR: Adds a convenience macro FB_CUDACHECKTHROW_EX_NOCOMM that wraps FB_CUDACHECKTHROW_EX with std::nullopt values for use in contexts where communicator information (rank, commHash, commDesc) is not available.


Detailed Overview

Context & Motivation

As part of MCCL's fault tolerance efforts (T247416168), we are migrating CUDA error handling in CTRAN from std::runtime_error to ctran::utils::Exception. The previous diff (D90191648) introduced FB_CUDACHECKTHROW_EX, which requires rank, comm hash, and description parameters for enhanced error reporting.

However, there are contexts within the codebase where these communicator-specific values are not available (e.g., utility functions, initialization code, or standalone CUDA operations). Requiring callers to manually pass std::nullopt for all three parameters in these cases is verbose and error-prone.

Reviewed By: arttianezhu

Differential Revision: D90192768

Ben Carver added 2 commits January 9, 2026 08:06
Summary:
**TL;DR:** Adds a new `FB_CUDACHECKTHROW_EX` macro that throws `ctran::utils::Exception` instead of `std::runtime_error` on CUDA failures, enabling richer error context (rank, comm hash, description) for improved fault tolerance debugging.

---

# Detailed Overview

## Context & Motivation

As part of MCCL's fault tolerance efforts (T247416168), we are migrating error handling in CTRAN from `std::runtime_error` to `ctran::utils::Exception`. This migration enables structured error reporting with additional metadata (rank, comm hash, operation description) that is critical for diagnosing failures in distributed communication workloads.

The existing `FB_CUDACHECKTHROW` macro throws a generic `std::runtime_error` when CUDA calls fail, which lacks the context needed for effective fault tolerance debugging. This diff introduces an extended version that preserves backward compatibility while enabling enhanced error reporting for callers that have access to communicator context.

Differential Revision: D90191648
Summary:
**TL;DR:** Adds a convenience macro `FB_CUDACHECKTHROW_EX_NOCOMM` that wraps `FB_CUDACHECKTHROW_EX` with `std::nullopt` values for use in contexts where communicator information (rank, commHash, commDesc) is not available.

---

# Detailed Overview

## Context & Motivation

As part of MCCL's fault tolerance efforts (T247416168), we are migrating CUDA error handling in CTRAN from `std::runtime_error` to `ctran::utils::Exception`. The previous diff (D90191648) introduced `FB_CUDACHECKTHROW_EX`, which requires rank, comm hash, and description parameters for enhanced error reporting.

However, there are contexts within the codebase where these communicator-specific values are not available (e.g., utility functions, initialization code, or standalone CUDA operations). Requiring callers to manually pass `std::nullopt` for all three parameters in these cases is verbose and error-prone.

Reviewed By: arttianezhu

Differential Revision: D90192768
@meta-codesync
Copy link

meta-codesync bot commented Jan 9, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90192768.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant