Commit 15b82ae
[Data] Fix driver hang during streaming generator block metadata retrieval (#56451)
## Why are these changes needed?
This PR fixes a critical driver hang issue in Ray Data's streaming
generator. The problem occurs when computation completes and block data
is generated, but the worker crashes before the metadata object is
generated, causing the driver to hang completely until the task's
metadata is successfully rebuilt. This creates severe performance
issues, especially in cluster environments with significant resource
fluctuations.
## What was the problem?
**Specific scenario:**
1. Computation completes, block data is generated
2. Worker crashes before the metadata object is generated
3. Driver enters the
[physical_operator.on_data_ready()](https://github.com/ray-project/ray/blob/ray-2.46.0/python/ray/data/_internal/execution/interfaces/physical_operator.py#L124)
logic and waits indefinitely for metadata until task retry succeeds and
meta object becomes available
4. If cluster resources are insufficient, the task cannot be retried
successfully, causing driver to hang for hours (actual case: 12 hours)
**Technical causes:**
- Using `ray.get(next(self._streaming_gen))` for metadata content
retrieval, which may hang indefinitely
- Lack of timeout mechanisms and state tracking, preventing driver
recovery from hang state
- No proper handling when worker crashes between block generation and
metadata generation
## What does this fix do?
- Adds `_pending_block_ref` and `_pending_meta_ref` state tracking to
properly handle block/metadata pairs
- Uses `ray.get(meta_ref, timeout=1)` with timeout for metadata content
retrieval
- Adds error handling for `GetTimeoutError` with warning logs
- Prevents unnecessary re-fetching of already obtained block references
- **Key improvement: Prevents driver from hanging for extended periods
when worker crashes between block and metadata generation**
## Related issue number
Fixes critical performance issue in streaming data processing that
causes driver to hang for extended periods (up to 12 hours) when workers
crash between block generation and metadata generation, especially in
cluster environments with significant resource fluctuations.
## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- **Testing Strategy**
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: dragongu <andrewgu@vip.qq.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>1 parent 6320275 commit 15b82ae
File tree
4 files changed
+263
-124
lines changed- python/ray/data
- _internal
- execution/interfaces
- tests
- preprocessors
4 files changed
+263
-124
lines changedLines changed: 75 additions & 23 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| 33 | + | |
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
| |||
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
41 | 47 | | |
42 | 48 | | |
43 | 49 | | |
| |||
93 | 99 | | |
94 | 100 | | |
95 | 101 | | |
96 | | - | |
97 | | - | |
| 102 | + | |
| 103 | + | |
98 | 104 | | |
99 | 105 | | |
100 | 106 | | |
| |||
115 | 121 | | |
116 | 122 | | |
117 | 123 | | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
118 | 131 | | |
119 | 132 | | |
120 | 133 | | |
| |||
128 | 141 | | |
129 | 142 | | |
130 | 143 | | |
131 | | - | |
132 | | - | |
133 | | - | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
134 | 160 | | |
135 | 161 | | |
136 | 162 | | |
137 | | - | |
138 | | - | |
139 | | - | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
140 | 187 | | |
141 | 188 | | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
142 | 193 | | |
143 | | - | |
| 194 | + | |
144 | 195 | | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
158 | 207 | | |
159 | 208 | | |
160 | 209 | | |
161 | 210 | | |
162 | | - | |
| 211 | + | |
163 | 212 | | |
164 | 213 | | |
165 | 214 | | |
166 | 215 | | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
167 | 219 | | |
168 | 220 | | |
169 | 221 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
13 | 12 | | |
14 | 13 | | |
15 | 14 | | |
| |||
1675 | 1674 | | |
1676 | 1675 | | |
1677 | 1676 | | |
1678 | | - | |
1679 | | - | |
| 1677 | + | |
| 1678 | + | |
1680 | 1679 | | |
1681 | | - | |
1682 | | - | |
1683 | | - | |
1684 | | - | |
| 1680 | + | |
| 1681 | + | |
| 1682 | + | |
| 1683 | + | |
| 1684 | + | |
| 1685 | + | |
| 1686 | + | |
| 1687 | + | |
| 1688 | + | |
1685 | 1689 | | |
1686 | 1690 | | |
1687 | 1691 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
55 | 56 | | |
56 | 57 | | |
57 | 58 | | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
67 | 68 | | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
77 | 76 | | |
78 | | - | |
79 | | - | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | | - | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
107 | 107 | | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
117 | 115 | | |
118 | | - | |
119 | | - | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
120 | 119 | | |
121 | 120 | | |
122 | 121 | | |
| |||
171 | 170 | | |
172 | 171 | | |
173 | 172 | | |
174 | | - | |
175 | | - | |
176 | | - | |
177 | | - | |
178 | | - | |
179 | | - | |
180 | | - | |
181 | | - | |
182 | | - | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
183 | 182 | | |
184 | | - | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
193 | 190 | | |
194 | | - | |
195 | | - | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
196 | 194 | | |
197 | 195 | | |
198 | 196 | | |
| |||
211 | 209 | | |
212 | 210 | | |
213 | 211 | | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
221 | | - | |
222 | | - | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
223 | 221 | | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
231 | | - | |
232 | | - | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
233 | 229 | | |
234 | | - | |
235 | | - | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
236 | 233 | | |
237 | 234 | | |
238 | 235 | | |
| |||
0 commit comments