Commit 3cb6f6a
[train][checkpoint] Add ray.train.get_all_reported_checkpoints method (ray-project#54555)
# Summary
This PR adds a `ray.train.get_all_reported_checkpoints` method that
allows users to get all the checkpoints they have reported from within
their training function.
This is different from
[Result](https://docs.ray.io/en/latest/train/user-guides/results.html)
in two ways:
* It is called from the training function on the training worker instead
of from the driver
* It can be called while training is still in progress
# Implementation Notes
The main idea is to use a worker-side counter and controller-side
counter as follows:
* Train worker: `ray.train.report` increments a
`num_reported_checkpoints` counter and puts the training result into its
queue
* Train controller: polls the training results from all worker,
registers the checkpoint, increments `num_reported_checkpoints`, and
then creates an asyncio task to notify asyncio Condition. This works
because asyncio Ray actors should always have an event loop.
* Train worker: `get_all_reported_results` uses an asyncio.Condition to
wait until the worker-side `num_reported_checkpoints` counter matches
its controller-side counterpart before returning the checkpoints. This
ensures that we wait for all pending reports to finish. It has access to
the controller actor through `init_train_context`.
`get_checkpoint` should be unaffected because it uses the local
checkpoint; we can consider changing it to use the "centrally committed"
checkpoint in the future.
# Testing
I ran the [ray train pytorch
example](https://docs.ray.io/en/latest/train/getting-started-pytorch.html)
and called `ray.train.get_all_reported_checkpoints` at the end of each
epoch. The results are as expected; here are a few examples
`
epoch 4: [TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-34-52.538994),
metrics={'loss': 0.24510294198989868, 'epoch': 0}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-07.511694),
metrics={'loss': 0.23799467086791992, 'epoch': 1}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-24.355974),
metrics={'loss': 0.39628422260284424, 'epoch': 2}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-40.273211),
metrics={'loss': 0.15193207561969757, 'epoch': 3}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-56.178119),
metrics={'loss': 0.17416314780712128, 'epoch': 4})]
`
`
epoch 9: [TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-34-52.538994),
metrics={'loss': 0.24510294198989868, 'epoch': 0}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-07.511694),
metrics={'loss': 0.23799467086791992, 'epoch': 1}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-24.355974),
metrics={'loss': 0.39628422260284424, 'epoch': 2}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-40.273211),
metrics={'loss': 0.15193207561969757, 'epoch': 3}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-35-56.178119),
metrics={'loss': 0.17416314780712128, 'epoch': 4}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-36-12.547310),
metrics={'loss': 0.2924661934375763, 'epoch': 5}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-36-28.538090),
metrics={'loss': 0.18640762567520142, 'epoch': 6}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-36-44.583228),
metrics={'loss': 0.12567029893398285, 'epoch': 7}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-37-00.540405),
metrics={'loss': 0.1620682030916214, 'epoch': 8}),
TrainingResult(checkpoint=Checkpoint(filesystem=local,
path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-04_17-37-17.129973),
metrics={'loss': 0.07022886723279953, 'epoch': 9})]
`
I also modified all the Ray Train v2 unit tests that call
`ray.train.report`:
* `test_persistence` also verifies that `get_all_reported_checkpoints`
works on resumption
* `test_data_parallel_trainer` verifies that
`get_all_reported_checkpoints` stalls until all workers report.
I also verified that `get_all_reported_checkpoints` produced similar
output when called from Tune + Train.
I tried to test that `get_all_reported_checkpoints` finished even with
graceful abort but was unable to create such a scenario since
`get_all_reported_checkpoints` returns very quickly and each `report`
forms a barrier.
---------
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>1 parent 084598c commit 3cb6f6a
File tree
17 files changed
+252
-31
lines changed- ci/lint
- python/ray/train
- v2
- _internal/execution
- checkpoint
- controller
- worker_group
- api
- tests
17 files changed
+252
-31
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2006 | 2006 | | |
2007 | 2007 | | |
2008 | 2008 | | |
2009 | | - | |
2010 | | - | |
2011 | | - | |
2012 | | - | |
2013 | | - | |
2014 | | - | |
2015 | | - | |
2016 | | - | |
2017 | 2009 | | |
2018 | 2010 | | |
2019 | 2011 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
37 | 38 | | |
38 | 39 | | |
| 40 | + | |
39 | 41 | | |
40 | 42 | | |
41 | 43 | | |
| |||
76 | 78 | | |
77 | 79 | | |
78 | 80 | | |
| 81 | + | |
79 | 82 | | |
80 | 83 | | |
81 | 84 | | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
82 | 89 | | |
83 | 90 | | |
84 | 91 | | |
Lines changed: 40 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
1 | 2 | | |
2 | 3 | | |
3 | 4 | | |
| |||
16 | 17 | | |
17 | 18 | | |
18 | 19 | | |
| 20 | + | |
19 | 21 | | |
20 | 22 | | |
21 | 23 | | |
| |||
81 | 83 | | |
82 | 84 | | |
83 | 85 | | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
84 | 92 | | |
85 | 93 | | |
86 | 94 | | |
| |||
139 | 147 | | |
140 | 148 | | |
141 | 149 | | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
142 | 158 | | |
143 | 159 | | |
144 | 160 | | |
| |||
267 | 283 | | |
268 | 284 | | |
269 | 285 | | |
| 286 | + | |
270 | 287 | | |
271 | 288 | | |
272 | 289 | | |
| |||
279 | 296 | | |
280 | 297 | | |
281 | 298 | | |
| 299 | + | |
282 | 300 | | |
283 | 301 | | |
284 | 302 | | |
285 | 303 | | |
286 | 304 | | |
287 | | - | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | | - | |
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| 27 | + | |
26 | 28 | | |
27 | 29 | | |
28 | 30 | | |
| |||
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
48 | | - | |
| 50 | + | |
49 | 51 | | |
50 | 52 | | |
51 | 53 | | |
52 | 54 | | |
53 | 55 | | |
54 | | - | |
| 56 | + | |
55 | 57 | | |
56 | 58 | | |
57 | 59 | | |
| |||
96 | 98 | | |
97 | 99 | | |
98 | 100 | | |
| 101 | + | |
| 102 | + | |
99 | 103 | | |
100 | | - | |
| 104 | + | |
| 105 | + | |
101 | 106 | | |
102 | 107 | | |
103 | 108 | | |
| |||
137 | 142 | | |
138 | 143 | | |
139 | 144 | | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
140 | 152 | | |
141 | 153 | | |
142 | 154 | | |
| |||
189 | 201 | | |
190 | 202 | | |
191 | 203 | | |
192 | | - | |
| 204 | + | |
193 | 205 | | |
194 | 206 | | |
195 | 207 | | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
196 | 213 | | |
197 | 214 | | |
198 | 215 | | |
| |||
212 | 229 | | |
213 | 230 | | |
214 | 231 | | |
215 | | - | |
| 232 | + | |
216 | 233 | | |
217 | 234 | | |
218 | 235 | | |
| |||
265 | 282 | | |
266 | 283 | | |
267 | 284 | | |
| 285 | + | |
268 | 286 | | |
269 | 287 | | |
270 | 288 | | |
| |||
Lines changed: 16 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
70 | 74 | | |
71 | 75 | | |
72 | 76 | | |
| |||
275 | 279 | | |
276 | 280 | | |
277 | 281 | | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
278 | 286 | | |
279 | 287 | | |
280 | 288 | | |
| |||
537 | 545 | | |
538 | 546 | | |
539 | 547 | | |
540 | | - | |
541 | 548 | | |
542 | 549 | | |
543 | 550 | | |
| |||
553 | 560 | | |
554 | 561 | | |
555 | 562 | | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
Lines changed: 15 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
6 | 5 | | |
7 | 6 | | |
8 | 7 | | |
9 | 8 | | |
10 | 9 | | |
11 | 10 | | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
12 | 15 | | |
13 | 16 | | |
14 | 17 | | |
| |||
21 | 24 | | |
22 | 25 | | |
23 | 26 | | |
24 | | - | |
| 27 | + | |
25 | 28 | | |
26 | 29 | | |
27 | 30 | | |
| |||
46 | 49 | | |
47 | 50 | | |
48 | 51 | | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
49 | 61 | | |
50 | 62 | | |
51 | 63 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
194 | 194 | | |
195 | 195 | | |
196 | 196 | | |
| 197 | + | |
197 | 198 | | |
198 | 199 | | |
199 | 200 | | |
| |||
213 | 214 | | |
214 | 215 | | |
215 | 216 | | |
| 217 | + | |
216 | 218 | | |
217 | 219 | | |
218 | 220 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
437 | 437 | | |
438 | 438 | | |
439 | 439 | | |
| 440 | + | |
440 | 441 | | |
441 | 442 | | |
442 | 443 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
0 commit comments