Invalid device ordinal 

Got the following error; this happened both, when I installed it directly on the machine as well as when I tried to use docker (Second was done to ensure it was not a configuration error) 

Machine is a server on Ubuntu 24 with 4*4090  (normal nanogpt did run). Not sure, why it says invalid device. All tests for identifying devices ran normal (four devices with 0..3 numbers) also used in that way in other programs. 


==>>>  sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt sh run.sh
W1226 16:51:59.401000 7 site-packages/torch/distributed/run.py:792]
W1226 16:51:59.401000 7 site-packages/torch/distributed/run.py:792] *****************************************
W1226 16:51:59.401000 7 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1226 16:51:59.401000 7 site-packages/torch/distributed/run.py:792] *****************************************
Traceback (most recent call last):
  File "/modded-nanogpt/train_gpt2.py", line 431, in <module>
    torch.cuda.set_device(device)
  File "/usr/local/lib/python3.12/site-packages/torch/cuda/__init__.py", line 476, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

using device: cuda:0
Traceback (most recent call last):
  File "/modded-nanogpt/train_gpt2.py", line 431, in <module>
    torch.cuda.set_device(device)
  File "/usr/local/lib/python3.12/site-packages/torch/cuda/__init__.py", line 476, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/modded-nanogpt/train_gpt2.py", line 431, in <module>
    torch.cuda.set_device(device)
  File "/usr/local/lib/python3.12/site-packages/torch/cuda/__init__.py", line 476, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/modded-nanogpt/train_gpt2.py", line 431, in <module>
    torch.cuda.set_device(device)
  File "/usr/local/lib/python3.12/site-packages/torch/cuda/__init__.py", line 476, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

using device: cuda:3
using device: cuda:1
using device: cuda:2
W1226 16:52:43.205000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 74 closing signal SIGTERM
W1226 16:52:43.216000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 75 closing signal SIGTERM
W1226 16:52:43.226000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 76 closing signal SIGTERM
W1226 16:52:43.234000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 77 closing signal SIGTERM
W1226 16:52:43.243000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 78 closing signal SIGTERM
W1226 16:52:43.250000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 79 closing signal SIGTERM
W1226 16:52:43.255000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 80 closing signal SIGTERM
E1226 16:52:45.454000 7 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 7 (pid: 81) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_gpt2.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-26_16:52:43
  host      : 30c961130fdb
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 81)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================


===> The state of the machine. 
nvidia-smi
Thu Dec 26 16:55:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:25:00.0 Off |                  Off |
| 32%   29C    P8              6W /  450W |       4MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:41:00.0 Off |                  Off |
| 31%   30C    P8              5W /  450W |       4MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off |   00000000:A1:00.0 Off |                  Off |
| 32%   29C    P8              4W /  450W |       4MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
| 31%   27C    P8              4W /  450W |       4MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid device ordinal #61

train_gpt2.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-26_16:52:43
host : 30c961130fdb
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 81)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Invalid device ordinal #61

Description

train_gpt2.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-26_16:52:43 host : 30c961130fdb rank : 7 (local_rank: 7) exitcode : 1 (pid: 81) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-26_16:52:43
host : 30c961130fdb
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 81)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html