Skip to content

Error Handling: refactor ExecuteComputation and ExecuteReplicated to propagate status. #9445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: ysiraichi/status-for-oom-errors
Choose a base branch
from

Conversation

ysiraichi
Copy link
Collaborator

@ysiraichi ysiraichi commented Jul 3, 2025

This PR refactors both ComputationClient:::ExecuteComputation and ComputationClient:::ExecuteReplicated functions for propagating error statuses.

Key Changes:

  • Refactors ExecuteComputation and ExecuteReplicated: Now explicitly propagates absl::Status.
  • Replaces Raw .value() Calls: Uses XLA_ASSIGN_OR_RETURN_WITH_LOCATION for safer error handling.
  • Updates Call Sites: Leverages GetValueOrThrow for more robust error management.

Example:

# Run this in eager mode + with C++ error context set.
a = torch.rand(1024, 1024, 1024, 1024, 1024, device=device)

Before this PR: no source code location

Traceback (most recent call last):
  File "/home/ysiraichi/ext/examples/mem.py", line 12, in <module>
    a = torch.rand(1024, 1024, 1024, 1024, 1024, device=device)
RuntimeError: Out of memory allocating 4503599627370496 bytes. 

After this PR: source code location of the function call that raised this error

Traceback (most recent call last):
  File "/home/ysiraichi/ext/examples/mem.py", line 12, in <module>
    a = torch.rand(1024, 1024, 1024, 1024, 1024, device=device)
RuntimeError: Out of memory allocating 4503599627370496 bytes. (at torch_xla/csrc/runtime/pjrt_computation_client.cpp:761)

@ysiraichi
Copy link
Collaborator Author

Blocked until #9429 is merged.

@ysiraichi ysiraichi force-pushed the ysiraichi/status-for-oom-eager-mode branch from 38b0ebf to bb72d7f Compare July 3, 2025 20:33
Key changes:
- Updated base `ComputationClient` interface to return `absl::StatusOr<std::vector<DataPtr>>`
- Modified IFRT and PjRt implementations to use proper error propagation
- Replaced raw `.value()` calls with `XLA_ASSIGN_OR_RETURN_WITH_LOCATION` macros
- Updated all call sites to use `GetValueOrThrow` for exception-based error handling
@ysiraichi ysiraichi force-pushed the ysiraichi/status-for-oom-eager-mode branch from bb72d7f to 8f1ba5e Compare July 4, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant