Make PADDLE_ENFORCE and PADDLE_THROW catchable #3055

reyoung · 2017-07-25T09:00:19Z

Use EnforceNotMet to unify all exception types.

* Use EnforceNotMet to unify all exception types.

gangliao

LGTM

…hen device_id is None 根据日志 `Coverage test/4_Test.txt` 的分析，发现以下关键问题： **错误信息：** ``` UnimplementedError: Place Place(gpu:0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_IPU option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/phi/backends/context_pool.cc:77) ``` **失败的测试：** - `test_save_load_state_dict` (Test PaddlePaddle#3055) - 具体测试用例：`test_save_safetensors_load_fc` **错误堆栈：** ```python File "save_safetensors_load_fc.py", line 120, in test_save_safetensors_load_fc load_state_dict(sharded_state_dict, ckpt_path, safetensors=True) ↓ File "load_state_dict.py", line 1354, in process_local_copy_tasks src_chunk_tensor.cuda() # ← 问题发生在这里 ↓ File "tensor_patch_methods.py", line 1174, in cuda res = self._copy_to(res_place, blocking) ``` 在 tensor_patch_methods.py 的 `cuda()` 方法中（第1167行），当 `device_id=None` 且当前设备不是CUDA设备时，代码硬编码使用 `core.CUDAPlace(0)`： ```python if not isinstance(res_place, core.CUDAPlace): res_place = core.CUDAPlace(0) # ← 总是使用 GPU 0 ``` 在分布式训练场景（使用2个GPU，`--devices 0,1`）中： - **Rank 0** 进程应该使用 GPU 0 ✅ - **Rank 1** 进程应该使用 GPU 1 ✅ - 但代码强制 Rank 1 也使用 GPU 0 ❌ 由于 Rank 1 进程的设备上下文只初始化了 GPU 1，当尝试访问 GPU 0 时，`context_pool.cc:77` 抛出异常。修改 tensor_patch_methods.py 第1159-1169行： **修改前：** ```python def cuda( self: Tensor, device_id: int | None = None, blocking: bool = True ) -> Tensor: if device_id is None: res_place = framework._current_expected_place() if not isinstance(res_place, core.CUDAPlace): res_place = core.CUDAPlace(0) # 硬编码 GPU 0 ``` **修改后：** ```python def cuda( self: Tensor, device_id: int | None = None, blocking: bool = True ) -> Tensor: if device_id is None: res_place = framework._current_expected_place() if not isinstance(res_place, core.CUDAPlace): # In distributed training, use the current device from environment # or default to device 0 for single GPU scenarios import os local_rank = int(os.getenv('PADDLE_RANK_IN_NODE', '0')) res_place = core.CUDAPlace(local_rank) # 使用当前进程的 GPU ``` - 通过读取环境变量 `PADDLE_RANK_IN_NODE` 获取当前进程在节点内的 rank - 该环境变量由 `paddle.distributed.launch` 自动设置 - 如果未设置（单GPU场景），默认为 `'0'` - 确保每个分布式进程使用正确的 GPU 设备 - **修复的测试**：`test_save_load_state_dict::test_save_safetensors_load_fc` - **适用场景**：所有使用 `tensor.cuda()` 且 `device_id=None` 的分布式训练代码 - **兼容性**：对单GPU训练无影响（默认值仍为0）

reyoung requested a review from gangliao July 25, 2017 09:00

reyoung force-pushed the feature/unify_enforce_error_to_make_it_catchable branch 2 times, most recently from 29464d8 to 273f3ea Compare July 25, 2017 09:02

Make PADDLE_ENFORCE and PADDLE_THROW catchable

e3f5fdc

* Use EnforceNotMet to unify all exception types.

reyoung force-pushed the feature/unify_enforce_error_to_make_it_catchable branch from 273f3ea to e3f5fdc Compare July 25, 2017 09:05

Fix unittest

bc09551

gangliao approved these changes Jul 25, 2017

View reviewed changes

gangliao merged commit 2f6e7a5 into PaddlePaddle:develop Jul 26, 2017

reyoung deleted the feature/unify_enforce_error_to_make_it_catchable branch August 2, 2017 11:35

heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021

move optimizer of roadsign config to a single file (PaddlePaddle#3055)

2b7a999

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make PADDLE_ENFORCE and PADDLE_THROW catchable #3055

Make PADDLE_ENFORCE and PADDLE_THROW catchable #3055

Uh oh!

reyoung commented Jul 25, 2017

Uh oh!

gangliao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make PADDLE_ENFORCE and PADDLE_THROW catchable #3055

Make PADDLE_ENFORCE and PADDLE_THROW catchable #3055

Uh oh!

Conversation

reyoung commented Jul 25, 2017

Uh oh!

gangliao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants