Skip to content

Conversation

@reyoung
Copy link
Collaborator

@reyoung reyoung commented Jul 25, 2017

  • Use EnforceNotMet to unify all exception types.

@reyoung reyoung requested a review from gangliao July 25, 2017 09:00
@reyoung reyoung force-pushed the feature/unify_enforce_error_to_make_it_catchable branch 2 times, most recently from 29464d8 to 273f3ea Compare July 25, 2017 09:02
* Use EnforceNotMet to unify all exception types.
@reyoung reyoung force-pushed the feature/unify_enforce_error_to_make_it_catchable branch from 273f3ea to e3f5fdc Compare July 25, 2017 09:05
Copy link
Contributor

@gangliao gangliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gangliao gangliao merged commit 2f6e7a5 into PaddlePaddle:develop Jul 26, 2017
@reyoung reyoung deleted the feature/unify_enforce_error_to_make_it_catchable branch August 2, 2017 11:35
heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021
youge325 added a commit to youge325/Paddle that referenced this pull request Nov 9, 2025
…hen device_id is None

根据日志 `Coverage test/4_Test.txt` 的分析,发现以下关键问题:

**错误信息:**
```
UnimplementedError: Place Place(gpu:0) is not supported.
Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_IPU option
or check that your train process set the correct device id if you use Executor.
(at /paddle/paddle/phi/backends/context_pool.cc:77)
```

**失败的测试:**
- `test_save_load_state_dict` (Test PaddlePaddle#3055)
- 具体测试用例:`test_save_safetensors_load_fc`

**错误堆栈:**
```python
File "save_safetensors_load_fc.py", line 120, in test_save_safetensors_load_fc
    load_state_dict(sharded_state_dict, ckpt_path, safetensors=True)
  ↓
File "load_state_dict.py", line 1354, in process_local_copy_tasks
    src_chunk_tensor.cuda()  # ← 问题发生在这里
  ↓
File "tensor_patch_methods.py", line 1174, in cuda
    res = self._copy_to(res_place, blocking)
```

在 tensor_patch_methods.py 的 `cuda()` 方法中(第1167行),当 `device_id=None` 且当前设备不是CUDA设备时,代码硬编码使用 `core.CUDAPlace(0)`:

```python
if not isinstance(res_place, core.CUDAPlace):
    res_place = core.CUDAPlace(0)  # ← 总是使用 GPU 0
```

在分布式训练场景(使用2个GPU,`--devices 0,1`)中:
- **Rank 0** 进程应该使用 GPU 0 ✅
- **Rank 1** 进程应该使用 GPU 1 ✅
- 但代码强制 Rank 1 也使用 GPU 0 ❌

由于 Rank 1 进程的设备上下文只初始化了 GPU 1,当尝试访问 GPU 0 时,`context_pool.cc:77` 抛出异常。

修改 tensor_patch_methods.py 第1159-1169行:

**修改前:**
```python
def cuda(
    self: Tensor, device_id: int | None = None, blocking: bool = True
) -> Tensor:
    if device_id is None:
        res_place = framework._current_expected_place()
        if not isinstance(res_place, core.CUDAPlace):
            res_place = core.CUDAPlace(0)  # 硬编码 GPU 0
```

**修改后:**
```python
def cuda(
    self: Tensor, device_id: int | None = None, blocking: bool = True
) -> Tensor:
    if device_id is None:
        res_place = framework._current_expected_place()
        if not isinstance(res_place, core.CUDAPlace):
            # In distributed training, use the current device from environment
            # or default to device 0 for single GPU scenarios
            import os
            local_rank = int(os.getenv('PADDLE_RANK_IN_NODE', '0'))
            res_place = core.CUDAPlace(local_rank)  # 使用当前进程的 GPU
```

- 通过读取环境变量 `PADDLE_RANK_IN_NODE` 获取当前进程在节点内的 rank
- 该环境变量由 `paddle.distributed.launch` 自动设置
- 如果未设置(单GPU场景),默认为 `'0'`
- 确保每个分布式进程使用正确的 GPU 设备

- **修复的测试**:`test_save_load_state_dict::test_save_safetensors_load_fc`
- **适用场景**:所有使用 `tensor.cuda()` 且 `device_id=None` 的分布式训练代码
- **兼容性**:对单GPU训练无影响(默认值仍为0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants