Error Handling: refactor `ExecuteComputation` and `ExecuteReplicated` to propagate status. #9445

ysiraichi · 2025-07-03T20:25:54Z

This PR refactors both ComputationClient:::ExecuteComputation and ComputationClient:::ExecuteReplicated functions for propagating error statuses.

Key Changes:

Refactors ExecuteComputation and ExecuteReplicated: Now explicitly propagates absl::Status.
Replaces Raw .value() Calls: Uses XLA_ASSIGN_OR_RETURN_WITH_LOCATION for safer error handling.
Updates Call Sites: Leverages GetValueOrThrow for more robust error management.

Example:

# Run this in eager mode + with C++ error context set.
a = torch.rand(1024, 1024, 1024, 1024, 1024, device=device)

Before this PR: no source code location

Traceback (most recent call last):
  File "/home/ysiraichi/ext/examples/mem.py", line 12, in <module>
    a = torch.rand(1024, 1024, 1024, 1024, 1024, device=device)
RuntimeError: Out of memory allocating 4503599627370496 bytes.

After this PR: source code location of the function call that raised this error

Traceback (most recent call last):
  File "/home/ysiraichi/ext/examples/mem.py", line 12, in <module>
    a = torch.rand(1024, 1024, 1024, 1024, 1024, device=device)
RuntimeError: Out of memory allocating 4503599627370496 bytes. (at torch_xla/csrc/runtime/pjrt_computation_client.cpp:761)

ysiraichi · 2025-07-03T20:32:04Z

Blocked until #9429 is merged.

Key changes: - Updated base `ComputationClient` interface to return `absl::StatusOr<std::vector<DataPtr>>` - Modified IFRT and PjRt implementations to use proper error propagation - Replaced raw `.value()` calls with `XLA_ASSIGN_OR_RETURN_WITH_LOCATION` macros - Updated all call sites to use `GetValueOrThrow` for exception-based error handling

ysiraichi force-pushed the ysiraichi/status-for-oom-eager-mode branch from 38b0ebf to bb72d7f Compare July 3, 2025 20:33

ysiraichi force-pushed the ysiraichi/status-for-oom-eager-mode branch from bb72d7f to 8f1ba5e Compare July 4, 2025 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error Handling: refactor `ExecuteComputation` and `ExecuteReplicated` to propagate status. #9445

Error Handling: refactor `ExecuteComputation` and `ExecuteReplicated` to propagate status. #9445

Uh oh!

ysiraichi commented Jul 3, 2025 •

edited

Loading

Uh oh!

ysiraichi commented Jul 3, 2025

Uh oh!

Uh oh!

Error Handling: refactor ExecuteComputation and ExecuteReplicated to propagate status. #9445

Are you sure you want to change the base?

Error Handling: refactor ExecuteComputation and ExecuteReplicated to propagate status. #9445

Uh oh!

Conversation

ysiraichi commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysiraichi commented Jul 3, 2025

Uh oh!

Uh oh!

Error Handling: refactor `ExecuteComputation` and `ExecuteReplicated` to propagate status. #9445

Error Handling: refactor `ExecuteComputation` and `ExecuteReplicated` to propagate status. #9445

ysiraichi commented Jul 3, 2025 •

edited

Loading